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Abstract 


Consistent distribution of single-source data to replicated comput- 
ing channels is a fundamental problem in fault-tolerant system design. 
The “Oral Messages” (OM) algorithm of Lamport, Shostak, and Pease 
solves this problem of Interactive Consistency (also known as Source 
Congruence or Byzantine Agreement) in the presence of m arbitrary 
(i.e., Byzantine) Faults, using m + 1 rounds of message exchange and 
n > 3m channels. 

A deficiency of OM and similar algorithms is that all faults are as- 
sumed to exhibit worst-case (i.e., arbitrary) behavior, so that the algo- 
rithm can tolerate no more “simple” faults than truly Byzantine ones. To 
overcome this deficiency, Thambidurai and Park introduced a “hybrid” 
fault model that distinguished three fault modes: asymmetric (Byzan- 
tine), symmetric, and benign; they also exhibited, along with an infor- 
mal “proof of correctness,” a modified version of OM that withstands 
a asymmetric, s symmetric, and b benign faults simultaneously, using 
m - f 1 rounds, provided n > 2a + 2s + b + m, and m > a. 

Unfortunately, this algorithm is flawed; it fails, for example, in the 
case n = 5, m = 1 when the transmitter has a benign fault and one of the 
receivers is Byzantine. We detected this flaw while undertaking a for- 
mal verification of the algorithm using our PVS mechanical verification 
system. Repairing this algorithm is not easy. We developed an incorrect 
version ourselves, and even “proved” it correct using ordinary, informal 
mathematics. 

The discipline of mechanically checked formal verification eventually 
enabled us to develop a correct algorithm for Interactive Consistency 
under the hybrid fault model. We present this algorithm, discuss its 
subtle points, and describe its formal specification and verification. Be- 
cause informal proofs seem unreliable in this domain, and because the 
consequences of failure could be catastrophic, we believe formal verifica- 
tion should become standard for algorithms intended for safety-critical 
applications. We argue that formal verification systems such as PVS are 
now sufficiently effective that their application to such problems may be 
considered routine. 

Keywords: interactive consistency, Byzantine agreement, hybrid fault 

models, formal verification. 
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Chapter 1 

Introduction 


Fault-tolerant systems are designed and evaluated against explicit assumptions re- 
garding the kinds and numbers of faults they are to tolerate. “Fault models,” which 
enumerate the assumed behaviors of faulty components, range from those that iden- 
tify many highly specific modes of failure, to those that comprise just a few broad 
classes. The advantage of a very detailed fault model is that the mechanisms of fault 
tolerance can be finely tuned to deliver maximum resilience from a given level of re- 
dundancy; the corresponding disadvantages are that an overlooked fault mode may 
cause unexpected failure in operation, and the need to counter many fault modes 
can lead to a complex design — which may itself be a source of faults. 

In contrast to designs that consider many fault modes are those that make no 
assumptions whatsoever about the behavior of faulty components. The advantage 
of such Byzantine fault-tolerant designs is that they cannot be defeated by un- 
expected failure modes; their disadvantage is that all faults are treated as “worst 
case,” so that large levels of redundancy tolerate relatively few faults. For example, 
a conventional Byzantine fault-tolerant architecture requires 3m + 1 channels to tol- 
erate m simultaneous faults of any kind within some of its functions [25,1]. Thus, a 
4-plex is needed in order to withstand a single fault, 1 and 5- and 6-plexes provide no 
additional benefit (in fact the additional channels will increase the fault arrival rate 
and thereby lower overall reliability). 2 This seems counterintuitive, since it is clear 
that suitably organized 5- and 6-plexes can withstand more faults, of some kinds, 
than a 4-plex. 

These observations motivate the study of fault-tolerant architectures and al- 
gorithms with respect to hybrid fault models that include the Byzantine, or “ar- 
bitrary, fault mode, together with a limited number of additional fault modes. 

1 Single fault tolerance can also be provided by architectures such as Draper Laboratory’s FTP, 
which uses only three full processors plus three simpler “interstages” [13]. 

2 If the system can be reconfigured following a fault, then a fifth and sixth channel can increase 
reliability if used as standby spares — but they serve no purpose as live channels. 
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Inclusion of the arbitrary fault mode (i.e., faults whose behaviors are entirely un- 
constrained) eliminates the fear that some unforeseen mode may defeat the fault- 
tolerance mechanisms provided, while inclusion of other fault modes allows greater 
resilience to be achieved for faults of these kinds than with a classical Byzantine 
fault-tolerant architecture. 

Our interest is architectures for digital flight-control systems, where fault- 
masking behavior is required to ultra-high levels of reliability. This means that 
not only must stochastic modeling show that adequate numbers and kinds of faults 
are masked to satisfy the mission requirements, but that convincing analytical ev- 
idence must attest to the soundness of the overall fault-tolerant architecture and 
to the correctness of the design and implementation of its mechanisms of fault 
tolerance. 3 Such a rational design for a “reliable computing platform” suitable for 
ultra-critical applications was established in the late 1970s and early 1980s by the 
SIFT architecture [36]: the system workload is executed by several independent 
channels operating in approximate synchrony, and results are subjected to majority 
voting. If all channels execute identical workloads on identical data, then majority 
voting is sufficient to mask arbitrary channel failures. However, majority voting 
is not sufficient to mask arbitrary failures in the distribution of single-source data 
(such as sensor samples) [25], nor in clock synchronization [16]. 

In this report, we focus on algorithms for reliably distributing single-source data 
to multiple channels in the presence of faults. This problem, known as Interac- 
tive Consistency” (although sometimes called “source congruence”), was first posed 
and solved for the case where faulty channels can exhibit arbitrary behavior by 
Pease, Shostak, and Lamport [25] in 1980. 4 Interactive Consistency is a symmet- 
ric problem: it is assumed that each channel has a “private value 1 (e.g., a set of 
sensor samples) and the goal is to ensure that every nonfaulty channel achieves 
an accurate record of the private value of every other nonfaulty channel. In 1982, 
Lamport, Shostak, and Pease [17] presented an asymmetric version of Interactive 
Consistency, which they called the “Byzantine Generals Problem ; here, the goal is 
to communicate a single value from a designated channel called the Commanding 
General” to all the other channels, which are known as “Lieutenant Generals.” The 
problem of real practical interest is Interactive Consistency, but the metaphor of 
the Byzantine Generals has proved so memorable that this formulation is better 
known; it can also be easier to describe algorithms informally using the Byzantine 
Generals formulation, although the balance of advantage can be reversed in truly 
formal presentations [27]. An algorithm for the Byzantine Generals problem can be 
converted to one for Interactive Consistency by simply iterating it over all channels 

3 There are examples where unanticipated behaviors of the mechanisms for fault tolerance became 

the primary source of system failure [18]. 

4 Davies and Wakerley had anticipated some of the issues a few years earlier [9]. 
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(each channel in turn taking the role of the Commander), so there is no disadvan- 
tage to considering the Byzantine Generals formulation. See [27] for more extended 
discussion of this topic. 

Lamport, Pease, and Shostak presented algorithms for solving the Byzantine 
Generals problem. The principal difficulty to be overcome in such algorithms is 
possibly asymmetric behavior on the part of faulty channels: such a channel may 
provide one value to a second channel, but a different value to a third, thereby mak- 
ing it difficult for the recipients to agree on a common value. Byzantine Generals 
algorithms overcome the possibility of faulty channels exhibiting asymmetric behav- 
ior by using several rounds of message exchange during which channel p tells channel 
q what value it received from channel r and so on. The precise form of the algorithm 
depends on assumptions about what a faulty channel may do when relaying such a 
message; under the “Oral Messages” assumption, there is no guarantee that a faulty 
channel will relay messages correctly. This corresponds to totally arbitrary behavior 
by faulty channels: not only can a faulty channel provide inconsistent data initially, 
but it can also relay data inconsistently. 5 

Using m + 1 rounds of message exchanges, the Oral Messages algorithm of Lam- 
port, Shostak, and Pease [17], which we denote OM(m), can withstand up to m 
arbitrary faults, provided n, the number of channels, satisfies n > 3m. The bound 
n > 3 m is optimal: Pease, Shostak, and Lamport proved that no algorithm based on 
the Oral Messages assumptions can withstand more arbitrary faults than this [25]. 
However, as we have already noted, OM(m) is not optimal when other than ar- 
bitrary faults are considered: other algorithms can withstand greater numbers of 
simpler faults for a given number of channels than OM(m). 

We are not the first to make these observations. Thambidurai and Park [33] and 
Meyer and Pradhan [20,21] have considered Interactive Consistency algorithms that 
resist multiple fault classes. Thambidurai and Park’s “Unified” model divides faults 
into three classes: nonmalicious (or benign ), symmetric malicious, and asymmetric 
malicious . A nonmalicious fault is one that produces detectably missing values (e.g., 
timing, omission, or crash faults), or that produces a “self-incriminating” value that 
all nonfaulty recipients can detect as bad (e.g., it fails checksum or format tests). A 
malicious fault is one that yields a value that is not detectably bad (i.e., it is a wrong , 
rather than a missing or manifestly corrupted, value). A symmetric malicious fault 
delivers the same wrong value to every nonfaulty receiver; an asymmetric malicious 
fault delivers (possibly) different wrong values (or missing or detectably bad values) 
to different nonfaulty receivers. The classical arbitrary or Byzantine fault is an 
asymmetric malicious fault in this classification. Note that a nonmalicious fault 
may be asymmetric in that different detectably bad values may be sent, but if a 

5 Under the “signed messages” assumption (which can be satisfied using digital signatures), an 
altered message can be detected by the recipient. 
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good value is sent to some receivers and manifestly bad values are sent to all others, 
then the sender is considered asymmetric malicious in this taxonomy. 

Thambidurai and Park present a variant on the classical Oral Messages algorithm 
that retains the effectiveness of that algorithm with respect to arbitrary faults, but 
that is also capable of withstanding more faults of the other kinds considered. 
In a later paper, Thambidurai, Park, and Trivedi [34] present reliability analyses 
that show this increased fault tolerance indeed provides superior reliability under 
plausible assumptions. McElvany-Hugue has also studied the reliability of related 
algorithms under this fault model, reaching similar conclusions [14]. 

Unfortunately, Thambidurai and Park’s algorithm (which they call Algorithm 
Z) has a serious flaw and fails in quite simple circumstances. In this report, we 
describe the flaw, and explain how straightforward attempts to repair it also fail. 
We then present a correct algorithm for the problem of Interactive Consistency 
under a hybrid fault model and present a proof of its correctness. Thambidurai and 
Park presented a proof of correctness for their flawed algorithm, and we have also 
developed some rather convincing “proofs” ourselves for other incorrect algorithms 
for this problem. Accordingly, we have developed a mechanically checked formal 
verification for our algorithm using the PVS verification system [22]. (In fact, all this 
work grew out of the attempt to formally verify their Algorithm Z.) We describe this 
formal verification and show that it is not particularly difficult. Because informal 
proofs seem unreliable in this domain, and the consequences of failure could be 
catastrophic, we argue that formal verification should become routine. 


1.1 Related Work 

Hybrid fault models were first introduced for the MAFT architecture, developed 
at Allied-Signal’s Aerospace Technology Center [15]. MAFT provides a rich set 
of fault- tolerant mechanisms and services, including clock synchronization (both 
steady-state and startup), interactive consistency with both exact and approxi- 
mate agreement, and sophisticated error detection and reconfiguration [35]. The 
interactive-consistency algorithm employed in MAFT apparently does not suffer 
from the problem we identified in Algorithm Z, and Michelle McElvany Hugue and 
others at Allied Signal have developed corrected versions of Algorithm Z and proofs 
of correctness that are similar to ours. 

The reason we selected hybrid fault models and their algorithms for study is that 
these seem among the most interesting and useful directions in current research for 

^ Meyer and Pradhan [21] consider 21 fault model that, in Thambidurai and Park s taxonomy, 
comprises only asymmetric malicious and benign faults. Their algorithm is derived from the algo- 
rithm of [11] and, like the parent algorithm, is not particularly suitable for the cases of practical 
interest (i.e., m = 1, or possibly m = 2, n less than 10). 
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fault-tolerant architectures. However, our interest is less in the algorithms them- 
selves, and more in the use of mechanically-checked formal methods as a systematic 
and rigorous means to analyze these algorithms, to identify all the assumptions on 
which they depend, to detect and help correct errors in their formulation, and to 
provide compelling arguments for their correctness. From this point of view, algo- 
rithms for hybrid fault models are particularly interesting because of their subtlety, 
and the extended case analysis required in their analysis. As with other algorithms 
where we have discovered errors in published analyses [28], our corrections are tech- 
nical adjustments (though quite radical in the case of the algorithm considered here) 
that build on the insights, algorithms, and analyses of the original authors. 



Chapter 2 


Requirements, Assumptions, 
and the Algorithms OM and Z 


Although the problem of real practical interest is Interactive Consistency, all the 
algorithms we consider are presented here in their Byzantine Generals formulation, 
since this appears simpler in informal presentations. The relationship between In- 
teractive Consistency and the Byzantine Generals Problem is examined in [27]. 


2.1 Requirements 

In the Byzantine Generals formulation of the problem, there are n participants, 
which we call “processors.” A distinguished processor, which we call the transmitter , 
possesses a value to be communicated to all the other processors, which we call the 
receivers. 1 There are n processors in total, of which some (possibly including the 
transmitter) may be faulty. The transmitter’s value is denoted v and the problem 
is to devise an algorithm that will allow each receiver p to compute an estimate v p 
of the transmitter’s value satisfying the following conditions: 

BG1: If receivers p and q are nonfaulty, then they agree on the value ascribed to 
the transmitter — that is, for all nonfaulty p and q, v p = v q . 

BG2: If the transmitter is nonfaulty, then every nonfaulty receiver computes the 
correct value — that is, for all nonfaulty p, v p — v. 

Conditions BG1 and BG2 are sometimes known as “Agreement” and “Validity,” 
respectively. 

‘Lamport, Shostak, and Pease [17] speak of a “Commanding General” and of “Lieutenant Gen- 
erals” where we say transmitter and receivers. 
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2*2 Assumptions 

The principal difficulty that must be overcome by a Byzantine Generals algorithm 
is that the transmitter may send different values to different receivers, thereby com- 
plicating satisfaction of condition BG1. To overcome this, algorithms use several 
“rounds” of message exchange during which processor p tells processor q what value 
it received from processor r and so on. Under the “Oral Messages” assumptions, 
the difficulty is compounded because a faulty processor q may “lie” to processor 
r about the value it received from processor p . More precisely, the Oral Messages 
assumptions are the following. 

Al: Every message that is sent between nonfaulty processors is correctly delivered. 
A2: The receiver of a message knows who sent it. 

A3: The absence of a message can be detected. 

In the classical Byzantine Generals problem, there are no constraints at all on 
the behavior of a faulty processor. 


2.3 Algorithm OM 

Lamport, Shostak, and Pease’s Algorithm OM solves the Byzantine Generals prob- 
lem under the Oral Messages assumption. The algorithm is parameterized by m, the 
number of rounds of message exchanges performed. OM(m) can withstand up to m 
faults, provided n > 3m, where n is the total number of processors. The algorithm 
is described recursively; the base case is OM(O). 

OM(O) 

1. The transmitter sends its value to every receiver. 

2. Each receiver uses the value obtained from the transmitter, or some 
arbitrary, but fixed, value if nothing is received. 

Now we can describe the general case. 

OM(m), m > 0 

1. The transmitter sends its value to every receiver. 

2. For each p, let v p be the value receiver p obtains from the transmit- 
ter, or else be some arbitrary, but fixed, value if it obtains no value. 

Each receiver p acts as the transmitter in Algorithm OM(m — 1) to 
communicate its value v p to each of the n — 2 other receivers. 
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3. For each p, and each q # p, let v q be the value receiver p ob- 
tained from receiver q in step (2) (using Algorithm OM(m — 1)), or 
else some arbitrary, but fixed, value if nothing was received. Each 
receiver p calculates the majority value among all values v q it re- 
ceives, and uses that as the transmitter’s value (or some arbitrary, 
but fixed, value if no absolute majority exists). 

The correctness of this algorithm (that it achieves BG1 and BG2 under certain 
assumptions) was proven in [17, page 390] and mechanically checked in [1,27]. 


2.4 Algorithm Z 

Thambidurai and Park’s Algorithm Z is a modification of OM intended to operate 
under their hybrid fault model described earlier. The difference between OM and Z 
is that the latter has a distinguished “error” value, E. Any processor that receives a 
missing or manifestly bad value replaces that value by E and uses E as the value that 
it passes on in the recursive instances of the algorithm. The majority voting that 
is required in OM, is replaced in Z by a majority vote with all E values eliminated. 
Thambidurai and Park claim that an m-round implementation of Algorithm Z can 
withstand a + s + b simultaneous faults, where a is the number of asymmetric 
malicious faults, s the number of symmetric malicious faults, and b the number of 
nonmalicious faults, provided a < m, and n, the number of processors, satisfies 
n > 2a + 2s + b + m. In the case of no symmetric malicious or nonmalicious faults 
(i.e., Byzantine faults only), we have m = o and s = b = 0, so that n > 3m and the 
algorithm provides the same performance as the classical Oral Messages algorithm. 

We and our colleagues at SRI have undertaken mechanically checked formal ver- 
ifications for a number of fault- tolerant algorithms, including OM [27], and have 
identified deficiencies in some of the previously published analyses (though not in 
the algorithms— see, for example [24,29,30]). Any changes to the established algo- 
rithms for Interactive Consistency must be subjected to intense scrutiny, for errors 
in these algorithms are single points of failure in any system that employs them. 
Changes that widen the classification of faults considered are likely to increase the 
case analysis, and hence the complexity and potential fallibility of arguments for 
the correctness of modified algorithms. We therefore considered Thambidurai and 
Park’s Algorithm Z an interesting candidate for formal verification. 

We began our attempt to formally verify Algorithm Z by studying the proof 
of its correctness provided by Thambidurai and Park [33, pages 96 and 97]. This 
proof follows the outline of the standard proof for OM [17, page 390] quite closely. 
However, we soon found that Thambidurai and Park’s proof is simultaneously more 
complicated than necessary and flawed in several details. The most serious fault 
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is that their Lemma 1 (all nonfaulty receivers get the correct value of a nonfaulty 
transmitter) fails to consider the case where the value sent by the transmitter is E. 
This can arise in recursive instances of the algorithm when nonfaulty receivers are 
passing on the value received from a faulty source. Further thought soon reveals 
that not only is the proof flawed, but the algorithm is incorrect: even systems with 
large numbers of processors may fail with only two faulty components. 

The simplest counterexample comprises five processors in which the transmitter 
has a nonmalicious fault, one of the receivers has an asymmetric malicious fault, 
and the algorithm is Z with one round (i.e., n = 5, a = 1, s = 0,6 = l,m = 1). 
All the nonfaulty receivers note E as the value received from the transmitter, and 
relay the value E to all the other receivers. The faulty receiver sends a different 
(non-E) value to each of the nonfaulty receivers. Each nonfaulty receiver then has 
three E values, and one non-E value; because E values are discarded in the majority 
vote, each nonfaulty receiver selects the value received from the faulty receiver as 
the value sent by the transmitter. Since these values are all different, the algorithm 
has failed to achieve agreement among the nonfaulty receivers. Observe that this 
scenario is independent of the number of receivers (provided there are more than 
three of them — two that should agree and one that is faulty), so the problem is not 
due to an inadequate level of redundancy. 



Chapter 3 

The Algorithm OMH 


OMH is our new algorithm for Interactive Consistency under a hybrid fault model. 
In this chapter, we present the fault model, the algorithm, and its correctness prop- 
erties semiformally; the mechanically-checked formal specification and verification 
are described in the next chapter. 


3.1 Hybrid Fault Model 

Our fault model is that of Thambidurai and Park, but with the cases renamed— we 
find the anthropomorphism in terms such as “malicious faults” unhelpful. 

The fault modes we distinguish for processors are arbitrary-faulty, symmetric- 
faulty , and manifest-faulty (also called crash- faulty). (These correspond to Tham- 
bidurai and Park’s asymmetric malicious, symmetric malicious, and nonmalicious 
faults, respectively.) Of course, we also need a class of good (also called nonfaulty) 
processors. We specify these fault modes semiformally as follows (the formal char- 
acterizations are presented in the following chapter). 

When a transmitter sends its value v to the receivers, the value obtained by a 
nonfaulty receiver p is: 

• v, if the transmitter is nonfaulty 

• E, if the transmitter is manifest-faulty 1 

• Unknown, if the transmitter is symmetric-faulty, but all receivers obtain the 
same value, 

’Some preprocessing of timeouts, parity and “reasonableness” checks, etc. may be necessary 
to identify manifestly faulty values. The intended interpretation is that the receiver detects the 
incoming value as missing or bad, and then replaces it by the distinguished value E. 
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• Completely unconstrained, if the transmitter is arbitrary-faulty. 

Note that it is not necessary to define the value received by a faulty receiver, 
because such receivers may send values completely unrelated to their inputs. Also 
note that manifest faults must be symmetric. If a processor were to “crash” during 
this protocol (or if some of its outgoing finks are broken, or if it is early or late 
transmitting on some finks), it would have to be counted as arbitrary-faulty, since 
different good receivers may obtain different values — even though the values sent 
are either correct or identifiably bad. It is possible to treat such cases as a new class 
of faults, which, depending on practical considerations, may be an interesting area 
for future research (see Section 3.6). 


3.2 Repairing Algorithm Z 

It seems that the flaw in Algorithm Z stems from the fact that it does not distinguish 
between values received from manifest-faulty processors and the report of such values 
received from nonfaulty processors; the single value E is used for both cases. Thus, 
a plausible repair for Algorithm Z introduces an additional distinguished value RE 
(for Reported Error); when a manifestly faulty value is received, the receiver notes it 
as E, but passes it on as RE; if an RE is received, it is noted and passed on as such. 
Only E values are discarded when the majority vote is taken. In the counterexample 
to Algorithm Z given above, the nonfaulty receivers in this modified algorithm will 
each interpret the value received from the transmitter as E, and pass it on to the 
other receivers as RE, In their majority votes, each nonfaulty receiver has a single 
E (from the transmitter), which it discards, two REs (from the other nonfaulty 
receivers), and an arbitrary value (from the faulty receiver). All will therefore select 
RE as the value ascribed to the transmitter. 

Unfortunately this modified algorithm has two defects. First, a receiver that 
obtains a manifest-faulty value from the transmitter notes it as E, but passes it on 
as RE. Thus, this receiver will omit the value from its majority vote, but the others 
will include it (as RE). This asymmetry can be exploited by an arbitrary-faulty 
transmitter to force the receivers into disagreement (consider an arbitrary-faulty 
transmitter and three nonfaulty receivers, where the transmitter sends the values 
E, RE, and a normal value). 

It therefore seems that receivers must distinguish between an E received from 
the transmitter (which must be treated locally as RE and passed on as such), and 
one received from another receiver (which can be discarded in the majority vote). 
This repair fixes one problem, but leaves the other: the value ascribed to a manifest 
faulty transmitter is not E, but RE. This might seem a small inconvenience, but 
it causes the algorithm to fail when m, the number of rounds, is greater than 1 
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(consider the case n — 6, m — 2 when there is a nonfaulty transmitter and three 
manifest-faulty receivers). 

A repair to this difficulty might be to return the value E whenever the majority 
vote yields the value RE. This modification has the problem that receivers cannot 
distinguish a manifest-faulty receiver from a nonfaulty one reporting that another is 
manifest-faulty (consider the case n = 4, m = 1, all the processors are nonfaulty, and 
the transmitter is trying to send RE — as can arise in recursive cases when m > 1). 

Like Thambidurai and Park did for Algorithm Z, we produced rather convincing, 
but nonetheless flawed, informal “proofs of correctness” for these erroneous repairs 
to Algorithm Z. Eventually, the discipline of formal verification (where one must 
deal with the implacable skepticism of a mechanical proof checker and is eventually 
forced to confront overlooked cases and unstated assumptions) enabled us to develop 
a genuinely correct algorithm for this problem. 

Our new algorithm, OMH (for “Oral Messages, Hybrid”), is somewhat related 
to the last of the modifications to Algorithm Z indicated above, but recognizes that 
a single “reported error” value is insufficient. OMH employs two functions R and 
UnR that act as a “wrapper” and an “unwrapper” for error values. 

The basic idea of OMH is that at each round, the processors do not forward 
the actual value they received. Instead, each processor sends a value corresponding 
to the statement “I’m reporting value." One can imagine that after several rounds, 
messages corresponding to “I’m reporting that he’s reporting that she’s reporting an 
Error value” arise. This wrapper is only required for error values, but for simplicity 
we assume that the functions R and UnR are applied to all values. Alternatives to 
this are explored in Section 3.7. This leaves the following intuitive picture of the 
algorithm. 

Proceed as in the usual OM Byzantine agreement algorithm presented above, 
with the following exceptions. Add a distinguished error value E , and two functions 
on values R and UnR. When a manifestly bad value is received, temporarily record 
it as the special value E. 

When passing along a value received from the transmitter or incorporating it into 
the local majority vote, apply R, standing for “I report. . .” to the value. Discard 
all E values (received from other receivers) before voting, but treat all other error 
values (R(E), R(R(E)), etc.) as normal, potentially valid values during voting. 
After voting, apply UnR (strip off one R) before returning the value. 

The key idea here is that in Z and related algorithms there is a confusion about 
which processors have manifest faults: if there is only one error value, E, how 
can a processor distinguish between a manifest-faulty receiver and a good receiver 
reporting a bad value (or the lack of a value) from a manifest-faulty transmitter? The 
counterexample to Algorithm Z given above exploits this confusion, but it is handled 
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correctly by OMH, because the nonfaulty receivers in OMH(l) each receive a single 
E from the transmitter, which they pass on to the other receivers and themselves as 
R(E). The values thus voted on include three R(E ) s and an arbitrary value (from 
the arbitrary-faulty receiver). All nonfaulty receivers therefore select R(E) as the 
majority value. After stripping one R from this value, the result correctly identifies 
the transmitter as manifest-faulty. In short, OMH incorporates the diagnosis of 
manifest faults into the agreement algorithm. 

The Hybrid Oral Messages Algorithm OMH(m) is defined more formally below, 
and completely formally in Chapter 4.1: 

OMH(O) 

1. The transmitter sends its value to every receiver. 

2. Each receiver uses the value received from the transmitter, or uses 
the value E if a missing or manifestly erroneous value is received. 

OMH(m), m > 0 

1. The transmitter sends its value to every receiver. 

2. For each p, let v p be the value receiver p obtains from the transmit- 
ter, or E if no value, or a manifestly bad value, is received. 

Each receiver p acts as the transmitter in Algorithm OMH(m-l) to 
communicate the value R(v p ) to all of the n — 1 receivers, including 
itself. 

3. For each p and q , let v q be the value receiver p received from re- 
ceiver q in step (2) (using Algorithm OMH(m — 1)), or else E if no 
such value, or a manifestly bad value, was received. Each receiver 
p calculates the majority value among all non-error values v q re- 
ceived, (if no majority exists, the receiver uses some arbitrary, but 
functionally determined value) and then applies UnR to that value, 
using the result as the transmitter’s value. 


3.3 Semiformal Analysis and Correctness Arguments 


We make explicit a few unsurprising technical assumptions: 

• All processors are either nonfaulty, arbitrary-faulty, symmetric-faulty, or 
manifest-faulty. (Any fault not otherwise classified is considered arbitrary.) 
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• Processors do not change fault status during the procedure; for example, if 
a nonfaulty processor were to become manifest-faulty during this procedure, 
we would say that processor is arbitrary-faulty because it has effectively sent 
different values to other processors. 

• For all values v, R(v) / E. (Wrapped values are never mistaken for errors.) 

• For all values t?, UnR(R(v)) = v. (Unwrapping a wrapped value results in the 
original value.) 

Algorithm OMH must satisfy the Byzantine Generals conditions naturally ex- 
tended to the fault model described above. 

When the transmitter is symmetric-faulty, it is convenient to call the unique 
value received by all nonfaulty receivers the value actually sent by the transmitter. 

BGH1: If processors p and q are nonfaulty, then they agree on the value ascribed 
to the transmitter; that is, v v — v q . 

BGH2: If processor p is nonfaulty, the value ascribed to the transmitter by p is 

• The correct value v, if the transmitter is nonfaulty, 

• The value actually sent, if the transmitter is symmetric-faulty, 

• The value E, if the transmitter is manifest-faulty. 

The argument for the correctness of OMH is an adaptation of that for the Byzan- 
tine Generals formulation of OM [17, page 390]. We define 

• n, the number of processors, 

• a, the maximum number of arbitrary-faulty processors the algorithm is to 
tolerate, 

• s, the maximum number of symmetric-faulty processors the algorithm is to 
tolerate, 

• c, the maximum number of manifest-faulty processors the algorithm is to 
tolerate, 2 

• m, the number of rounds the algorithm is to perform. 

2 We cannot use m for the number of manifest-faulty processors, because the parameter m is 
traditionally used for the number of rounds (although Thambidurai and Park use r). The symbol 
c can be considered a mnemonic for “crashed,” which is one of the failures that can generate 
manifest-faulty behavior. 
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Lemma 1 For any a, s, c and m, Algorithm OMH(m) satisfies BGH2 if there are 
more than 2 (a + s) + c + m processors. 

Proof: The proof is by induction on m. BGH2 specifies only what must happen 
if the transmitter is not arbitrary-faulty. In the base case to = 0, a nonfaulty receiver 
obtains the transmitter’s value if the transmitter is nonfaulty. If the transmitter is 
symmetric-faulty the value obtained is the value actually sent. If the transmitter is 
manifest-faulty the receiver obtains the value E. So the trivial algorithm OMH(O) 
works as advertised and the lemma is true for m = 0. We now assume the lemma 
is true for m - 1 (m > 0), and prove it for m. 

In step (1) of the algorithm, the transmitter effectively sends some value v to all 
n - 1 receivers. If the transmitter is nonfaulty, v will be v, the correct value; if it is 
symmetric-faulty, u is the value actually sent; if it is manifest-faulty, u is E. In any 
case, we want all the nonfaulty receivers to decide on v. 

In step (2), each receiver applies OMH(m - 1) with n - 1 participants. Those 
receivers that are nonfaulty will apply the algorithm to the value R(o). Since by 
hypothesis n > 2(a + s) + c + m, we have n - 1 > 2(a + s) + c + (m - 1), so we 
can apply the induction hypothesis to conclude that the nonfaulty receiver p gets 
v g = R{v) for each nonfaulty receiver q. Let c' denote the number of manifest-faulty 
processors among the receivers. At most (a + s -f c') of the n - 1 receivers are 
faulty, so each nonfaulty receiver p obtains a minimum of n — 1 — (a + s + c') values 
equal to R(u). Since there are c' manifest-faulty processors among the receivers, a 
nonfaulty receiver p also obtains a minimum of c' values equal to E and, therefore, 
at most n — 1 — c values different from E. The value .ft(t') will therefore win the 
hybrid-majority vote performed by each nonfaulty processor p. provided 

2(n — 1 — (a + s -f- c ')) > n - 1 — c', 

that is, provided 

n 2(a -f- s) T c -}- 1, 

Now, c' < c, and 1 < to, so this condition is ensured by the constraint 
n > 2(a + s) + c + m. 

Finally, UnR is applied to the result R(o), which results in final value u. □ 
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Theorem 1 For any m, Algorithm OMH(m) satisfies conditions BGH1 and BGH2 
if there are more than 2(a + s) + c + m processors and m>a. 

Proof: The proof is by induction on m. In the base case m = 0 there can be no 
arbitrary-faulty processors, since m > a. If there are no arbitrary- faulty processors 
then the previous lemma ensures that OMH(O) satisfies BGH1 and BGH2. We 
therefore assume that the theorem is true for OMH(m- 1) and prove it for OMH(tti), 
m > 0. 

We next consider the case in which the transmitter is not arbitrary-faulty. Then 
BGH2 is ensured by Lemma 1, and BGH1 follows from BGH2. 

Now consider the case where the transmitter is arbitrary-faulty. There are at 
most a arbitrary-faulty processors, and the transmitter is one of them, so at most 
a — 1 of the receivers are arbitrary-faulty. Since there are more than 2(a + s) + c + m 
processors, there are more than 2(a + s) + c + m— 1 receivers, and 

2(o -M) + c + m — 1> 2([a — 1] + 5) + c + [m — 1]. 

We may therefore apply the induction hypothesis to conclude that OMH(m - 1) 
satisfies conditions BGH1 and BGH2. Hence, for each q, any two nonfaulty re- 
ceivers get the same value for v q in step (3). (This follows from BGH2 if one of the 
two receivers is processor q, and from BGH1 otherwise). Hence, any two nonfaulty 
receivers get the same vector of values t?j, . . ., all d therefore obtain the same 

value hybrid-majority {v u ... ,v n - 1 ) in step (3) (since this value is functionally de- 
termined), thereby proving BGH1. □ 


3.4 Extreme Cases 

Although a major improvement on OM, the number of faults that can be tolerated 
by OMH according to the analysis given above is not optimal in some of the extreme 
circumstances. In some cases, the algorithm is suboptimal; in others, the general 
analysis given above is too conservative. As an example of the latter, consider the 
case where only manifest faults are present. In this case, the general analysis above 
indicates that the number of manifest faults that can be tolerated is n — m — 1. 
in other words, the greater the number of rounds, the fewer manifest faults that 
can be tolerated. In fact, alternative analysis shows that OMH(m) tolerates the 
maximum possible number of manifest-faulty processors when there are no arbitrary 
nor symmetric faults. The only constraint is that there must be more processors 
(whether faulty or not) than rounds (since otherwise some recursive instances would 
be run on the empty set of processors). 
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Theorem 2 If arbitrary and symmetric faults are not present, Algorithm OMH(m) 
satisfies conditions BGHl and BGH2 provided there are more than m processors . ^ 

This theorem has been formalized and mechanically verified. The formal proof 
follows that of Theorem 1 closely, using analogous lemmas. However, here there 
are only two cases to consider (good and manifest) whereas there are four in the 
previous argument (good, manifest, symmetric, and arbitrary). 

When only symmetric faults are present, it is the algorithm, rather than its 
general analysis, that is less than optimal. Here, the additional rounds of message 
exchanges are actively counterproductive in the cases m> 0 (compare n = 4, s = 2 
for the cases m = 0 and m = 1). Additional rounds of messages are the price paid 
for overcoming arbitrary faults, and these seem to reduce the ability to deal with 
symmetric faults. An interesting topic for future research is to investigate whether 
this trade-off can be mitigated. 

When no manifest faults are present, Algorithm OMH becomes similar to the 
traditional Algorithm OM. A related point was made in [33]: in the absence of 
error values, hybrid majority is equivalent to majority. Thus the only substantive 
difference between OMH and OM are the wrapper and unwrapper functions applied 
to values. As discussed in Section 3.7 these functions may be identity on nonerror 
values, in which case OMH becomes exactly OM in the absence of manifest errors. 
Thus the analysis above may be applied, showing that the traditional algorithm 
OM(m) can withstand more faults than is suggested by its standard analysis: in 
fact, OM(m) satisfies conditions BGHl and BGH2 if there are more than 2(a+s)-|-m 
processors, m > a, and manifest faults are counted as symmetric. 


3.5 Benefits 


Recall that OM achieves agreement and validity if there are more than three times 
as many good processors as arbitrary-faulty processors (n > 3a). From the bounds 
given in Theorem 1 , n > 2(a + s) + c + m and m > a, it may be seen that OMH 
achieves the same resilience to arbitrary faults if there are no symmetric-faulty or 
manifest-faulty processors. Also, from Theorem 2, if a = s = 0, then all that is 
required is that n > r. 

While providing the same resilience to arbitrary or Byzantine faults, OMH 
achieves a higher degree of tolerance to other classes of faults than OM. Table 3.1 
indicates the different numbers of simultaneous faults that a 6-plex can withstand 
using OMH(l); for comparison, observe that OM(l) can withstand only a single 


3 Of course, the conditions are somewhat vacuous if there are less than two good processors. 
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Number of Faults 

Arbitrary (a) 

Symmetric ( s ) 

Manifest (c) 

i 

i 

0 

1 

0 

2 

0 

2 

0 

0 

1 

2 

0 

0 

5 


Table 3.1: Fault-Masking Capabilities of 0MH(1) with n - 6 

(arbitrary) fault in this configuration. 4 Thambidurai, Park, and Trivedi [34] present 
reliability analyses that show this increased fault tolerance indeed provides superior 
reliability under plausible assumptions 5 . McElvany-Hugue has also studied the reli- 
ability of related algorithms under this fault model, reaching similar conclusions [14]. 
In fact, our crash-only analysis above shows that OMH exhibits slightly greater fault 
tolerance than that assumed in these reliability analyses. 


3.6 Communications Faults 

A disadvantage of most fault models for Interactive Consistency, including the one 
used here, is that they attribute communications failure on a link connecting two 
processors to one or other of the processors concerned. 6 In the draconian fault- 
model of the original OM algorithm (i.e., all processors faults are Byzantine) this 
causes a communications fault on a link — a physical fault that may be considered 
fairly likely, and relatively innocuous in its effects — to be counted as one of the most 
difficult, and hopefully rare, of all faults. 

It is worth inquiring whether the hybrid model used here can be extended to 
treat communications faults as other than arbitrary processor faults. The problem 
in developing a hybrid fault model that includes communications faults as well as 
arbitrary processor faults, is that a communications fault does have the asymmetric 
character of a Byzantine or arbitrary fault. If we introduce communications faults 
as a new fault class, different from arbitrary processor faults, then we must account 

4 That is according to the classical analysis. As noted in the previous section, revised analysis of 
OM(l) shows that it can actually withstand two simultaneous faults, provided at most one of them 
is arbitrary. The chief difference between OM and OMH is that OM does not distinguish manifest 
faults from (other) symmetric ones. 

s Although Algorithm Z is flawed, the analysis in [34] can be correctly applied to OMH 

6 Perry and Toueg [26] presented an interactive-consensus algorithm for a fault model that admits 
communications failures, but that model does not consider Byzantine faults at all. 
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for these faults either in the result corresponding to Lemma 1 above, or in that 
corresponding to Theorem 1. The latter seems the most likely candidate, since 
it is the case that deals with asymmetrically faulty transmitters. However, the 
inductive proof used in Theorem 1 does not work for the case of a transmitter with 
an asymmetric but nonarbitrary communications fault (because in the recursive 
subcases we will still have the same number of arbitrary faults to mask, but one less 
round to do it in — and we cannot mask more arbitrary faults than rounds). The 
alternative is to consider communications faults in Lemma 1. It turns out that this is 
feasible, but is equivalent to regarding a communications fault as a symmetric-fault 
in the receiver. 

This conclusion seems fairly useful, so we record it in the following definition 
and theorem. 

• Let p and q be processors; if there is a communications fault on the link p — * q, 
then a receiver can receive any value (i.e., we allow intermittent and Byzantine 
behavior). 7 

Theorem 3 Let C be the set of links with communications faults. Then Theorem 
1 can be applied provided that for each p — ► q 6 C , either : 

• processor p is counted as arbitrary-faulty (whether it actually is or not), or 

• processor q is manifest- faulty, symmetric- faulty, or arbitrary-faulty and is 
counted as such, or 

• processor q is nonfaulty but is counted as symmetric- faulty. 

Proof: First, we consider the cases where processor q is faulty. In all cases 
(arbitrary-faulty, symmetric-faulty, and manifest-faulty), the behavior of a faulty 
processor is independent of the values it receives; hence the faulty link p — ► q is 
irrelevant. 

If q is nonfaulty, we can attribute the faulty link p — ► q to either p or g. If we 
attribute it to p, then p appears to manifest arbitrary (i.e., Byzantine) behavior, 
and must be counted as arbitrary-faulty. 

If we attribute the link fault p — ► q to q and q is nonfaulty, then the behavior 
seen by other processors is precisely that of a symmetric-faulty processor: if the link 
delivers a wrong value (or the correct one) to q , it will faithfully pass it on to all 
the other receivers; if the link delivers a corrupted (or no) value to q , it will pass on 
R(E). Thus q should be counted as a symmetric-faulty processor. □ 

This and related alternative models of link faults, and other simple but asym- 
metric classes of communication faults, are interesting avenues for further work. 

7 In the treatment used here, there is no advantage in a more restrictive model of communications 
faults. 
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3.7 Implementing R and UnR 

Although the informal and formal specifications suggests that R and UnR are ap- 
plied to all values at every round, this is unnecessary. R and UnR may be identity 
on nonerror values. That is, the following definitions suffice: 

R(x) d = if a: = R 3 E for some j then R 3+X E else x endif 
UnR(x) = f if i = R j+1 E for some j then R 3 E else x endif 
Thus, values v could be passed with an extra (say, highest order) bit denoting 
whether the word actually stands for a data value or for R V (E). R and U nR would 
then become increment and decrement operations conditional on the highest bit. 

If R and UnR are applied to all values at every round, perhaps as uncondi- 
tional increment and decrement operations, then intermediate error values such as 
R(R(E)) may coincide with valid data values. The algorithm remains correct be- 
cause UnR (decrement) is always applied to the output of the majority vote. 

Both of these implementations of R and U nR require unbounded integers in 
order to truly satisfy the requirements on R and UnR (for all v, R(v) ^ E, and 
UnR(R(v)) = v). However, for an m round OMH, just m + 1 error values (E up to 
R m (E)) suffice with suitable modifications to the algorithm. 

One could add a comparison of the number of applications of R with the depth 
of recursion in the algorithm OMH. (Simply computing R X (E) where x is taken 
modulo the total number of rounds leads to erroneous results.) Any values with 
more R's than elapsed rounds may correctly be considered to indicate manifest 
faults and treated as E , thus reducing the number of possible error values to one 
more than the number of rounds. In the common case of one-round OMH, two error 
values, corresponding to E and R(E) suffice. With only a small set of error values, 
it may no longer be necessary to distinguish them by setting a special bit: they 
could simply be allocated to values beyond the valid data range. 

Using these techniques, one may reduce the overhead of using OMH-like algo- 
rithms (as compared to OM) to a small constant number of extra data values, and 
a slightly more complex algorithm. These implementation techniques have not been 
formally verified. 



Chapter 4 


The Formal Specification and 
Verification 


We have formally specified the OMH(n) algorithm and formally verified that it sat- 
isfies the properties of agreement and validity using the PVS verification system [22]. 
The specification language of PVS is a higher-order logic with a very rich type sys- 
tem. This allowed us to specify the OMH algorithm, its assumptions, and properties 
fairly directly. PVS’s theorem prover or proof checker (we use either term, though 
the latter is more correct) is interactive and operates under the direct control of 
user: the user chooses each step that is to be applied and PVS performs it, displays 
the result, and then waits for the next command. PVS differs from most other in- 
teractive theorem provers in the power of its basic steps: these can invoke decision 
procedures for arithmetic, automatic rewriting, induction, and other relatively large 
units of deduction; it differs from other highly automated theorem provers in being 
directly controlled by the user. This style of mechanized proof checking allowed us 
to discover the flaws in our early formulations of the OMH algorithm, and to verify 
the properties of the final version with relatively little effort. 

We describe the formal specification in the next section, and its formal verifica- 
tion in the section after that. 


4.1 Formal Specification 

The formal specification is a single PVS theory omh (shown starting on page 31) 
that takes several parameters, beginning with a natural number m denoting the 
number of rounds of message exchanges to be performed, and a strictly positive 
natural number n that denotes the number of participants (i.e., channels or “fault 
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containment units”). Both the natural numbers (0,1,2,...) and the strictly pos- 
itive natural numbers (1,2,3,...) are predefined types in PVS (not, and posnat, 
respectively), which are specified in the “prelude” of standard definitions that are 
automatically loaded into PVS. The prelude theories are described in the PVS lan- 
guage reference [23], and can also be examined on-line using the PVS commands 
view-prelude-file and view-prelude-theory. 

The parameter list continues by introducing an uninterpreted type T, to repre- 
sent the class of values exchanged in the algorithm, and an uninterpreted constant 
error used to represent values that are recognized as manifestly erroneous. 

The remaining parameters to the OMH theory are the functions R and UnR, 
representing the “wrapping” and “unwrapping” functions that are performed on 
values as they are exchanged on the OMH algorithm. These functions must satisfy 
certain constraints (namely, wrapped values must not look like error values, and 
unwrapping a wrapped value must return the original value) that are stated as 
assumptions on the theory OMH, and discussed in Section 3.2. Formally, both 
are functions from T to T. The function R is used to prevent a value from being 
discarded by the hybrid majority vote. UnR is used to recover the correct value after 
the vote. Recall that error values are recorded as the values “sent” by manifest- 
faulty processors. The first assumption states that no R value is an error. 

R(t) ^ error. 

The second assumption states that UnR of R of a value is the same value. 
UnR(R(t)) = t. 

The algorithm proceeds through a number of “rounds” counted by the natural 
numbers 0,1,..., m; this range of numbers is specified as the type rounds, using 
the predefined type-constructor upto from the PVS prelude. Processors, or “fault 
containment units” are represented by the natural numbers 0, l,...,n- 1. This 
type, called feu, is specified in terms of the predefined type-constructor below from 
the PVS prelude. 1 The type feuset represents sets of feus, and is specified in terms 
of the predefined type-constructor setof , also from the PVS prelude. Finally, the 
type fcuvector is specified as the type of functions from feu s to T. 

Several variables are then introduced, and instantiations of some prelude theories 
are imported. Prelude theories are always available and do not need to be imported 
explicitly; the advantage of doing so, however, is that the required instances can 
be indicated, so that later references can use simple, rather than qualified, names. 

1 A slightly more elegant approach would make feu a type parameter, with an assumption that 
it is bijective with 6e/ow[n]; several prelude theories use this approach. 
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The theory finite. cardinality is one of several cardinality theories available in the 
PVS prelude; these theories differ in their assumptions concerning the type of the 
elements of the sets concerned. The theory finite-cardinality is applicable to sets 
whose elements are drawn from a finite type; its parameters are the type concerned 
(here feu), a natural number that is the cardinality of that type (here n), and a 
bijection (here identity[fcu]) from the canonical set below[n] of cardinality n to the 
type concerned. 

The theory filters defines a function filter that returns the set of members of a 
given set that satisfy a given predicate. Since predicates and sets are equivalent in 
higher-order logic, this operation is the same as set intersection. 2 The theory cardset 
provides some standard lemmas concerning cardinality, and filters (for example, the 
cardinality of a set is nonzero if and only if the set is nonempty); it takes the same 
arguments as finite-cardinality . The imported theory hybridmjrty is not essential to 
the main development and is described in Appendix A. 

The type statuses is defined to be an enumeration of exactly four constants, 
corresponding to the four categories of behavior: arbitrary , symmetric , manifest , 
and good. 3 The function status returns the status of a given processor (or fcu)\ this 
implicitly enforces our notion that a processor not change status during execution 
of the agreement protocol. A processor that, in reality, is symmetric-faulty one 
moment and manifest-faulty the next must be modeled as one that is arbitrary- 
faulty throughout the computation. 

Some shorthands are then defined for describing statuses: a, s, c, and g are 
predicates recognizing the arbitrary-faulty, symmetric-faulty, manifest-faulty, and 
good processors, respectively. Similarly, given a set caucus, as(caucus) is the set of 
arbitrary-faulty processors in caucus. The functions ss, cs and gs similarly select 
the symmetric-faulty, manifest-faulty, and good processors, respectively. A simple 
lemma, fincard-all , states that the cardinality of a set of processors is equal to the 
sum of the cardinalities of the subsets of its processors of each status. This lemma 
follows from a property implicit in the definition of statuses as an enumeration type: 
the members of the enumeration are inclusive and disjoint. 

The function send captures the properties of sending values from one processor 
to another. This function takes a value to be sent, a sender, and a receiver as 
arguments; it returns the value that would be received if the receiver were a good 
processor. The result actually received is irrelevant if the receiver is not a good 
processor (because the values passed on by faulty receivers are not assumed to be 
related to those received). We axiomatize the behavior of send according to the 

2 The theory filters also provides a similar function on lists, which is rather more complex. 

Enumeration constants are also overloaded as recognizer functions in PVS. Thus, if s is a 
variable of type statuses, s = arbitrary and arbitrary(s) are equivalent formulas. The latter form 
is used in this specification. 
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status of the sender. The first axiom simply says that a good processor sends 
correct values to all (good) receivers: 

g(p) D send(t,p,q) = t. 

Note that here, and in further formal definitions, free variables are universally bound 
at the outermost level, and the types of all variables are omitted for brevity. See the 
complete specification for subsidiary and variable declarations. The second axiom 
says that a manifest-faulty processor always delivers values that are recognized as 
erroneous by good receivers: 

c(p) D send(t,p,q) = error. 

The third axiom says that a symmetric-faulty processor sends the same value to all 
good receivers, although that value is otherwise unconstrained (i.e., it may be any 
possible value, including those that are recognized as erroneous) 

s(p) D send{t,p,q) = send(t,p,z). 

Nothing is specified for the behavior of arbitrary-faulty senders. A lemma (called 
send5 ) is stated and proved that all good receivers obtain the same value when the 
sender has any status but arbitrary-faulty: 

->a(p) D send(t,p, q) = send(t,p,z). 

A deficiency of this specification is that, because send is a function, even ar- 
bitrarily faulty processors are consistent from one round to the next: the value 
send{t,p, q) is some fixed value, suggesting that a faulty processor p, given the same 
value t, will always send the same (possibly bad) value to the processor q — even in 
different rounds of the protocol. This fact is not exploited in the proof, but it is not 
self-evident that this is so. In our verification of the OM algorithm [27], we added 
the round number as an additional argument to send in order to lessen this concern. 
However, the only way to allay such doubts absolutely is to specify send as a rela- 
tion. Our colleague Shankar has axiomatized the OM algorithm using a relational 
send, and has proven the corresponding correctness conditions. Unfortunately, the 
relational send complicates and obscures the specification (since it forces other func- 
tions to become relations also), so we have chosen to retain a functional send for 
this exercise. It is probable that a relational version could be created without great 
effort. 

Our formal specification of OMH is based on our earlier specification of the 
classical OM algorithm [27]. Rather than simply present the formal specification 
of OMH as a fait accompli, we first reproduce some of the development of the 
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specification for the OM algorithm from our earlier report, and then transform it 
into the OMH algorithm. 

We start by considering the Interactive-Consistency version of OM, which we 
call OMIC. We specify OMIC as a function of three arguments: m the number 
of rounds, v an fcuvector giving the private values of each processor, and caucus 
the set of processors participating in (this round of) the algorithm. OMIC will 
return a “vector” of fcuvectors : that is a function from feu to fcuvector. Thus 
OMIC(m, v, caucus)(p) will be the fcuvector of processor p following the OMIC 
algorithm, and OMIC(m, v, caucus)(p)(q) will be p’s opinion of q's private value. 
Notice that we are using higher-order functions here (i.e., functions whose values 
are functions). 

In preparation for formally specifying OMIC, we first state its behavior for the 
case m = 0. 


OMIC(0, f, caucus )(p)(q) = send(v(q) , q , p) 


Our requirement on OMIC in the case m = 0 simply states that p’s opinion of q's 
private value v(q) following the algorithm should be send(v(q), q,p). It might seem 
that we should require that both p and q should be members of the set caucus (as 
we did in the specification in [27]), but this is unnecessary because the value of the 
function is irrelevant p or q are not members of caucus . 

For the case m = r, r > 0, we require that p’s opinion of <^s private value 
should be send(v(q), q, q) if p = q , 4 otherwise it should be the majority value in p’s 
fcuvector , after performing OMIC with m — r - 1 on the current set of processors 
with q excluded, and the values received from q as the private values. Now the value 
received by an arbitrary processors 2 from q is send(v(q), <7, 2), so the fcuvector of 
such values is 

(Xz : send(t }{q),q,z)). 

The inner round of OMIC is therefore described by 

OMIC(r — 1 , (Xz : send(v(q), q , 2)), caucus - {<7}), 
and the fcuvector received by p following this is 

OMIC(r — 1, (A2 : send(v(g), <7, 2)), caucus — {?})(?)* 

4 We could specify v(g) in this case; we have chosen the weaker assumption that a faulty processor 
may not even know its own value. 






26 


Chapter 4. The Formal Specification and Verification 


Thus the required specification is: 



The function Majority takes a set of processors (here caucus-{q}), and an fcuvector , 
and computes the majority value (if any) in that vector over that set. 5 

The two behaviors stated above (for the cases m = 0, and m > 0, respectively) 
could be specified as axioms defining the function OMIC; we prefer, however, to spec- 
ify the function definitionally and to deduce those properties as (straightforward) 
lemmas. The advantage of the definitional specification is that the PVS typechecker 
will guarantee its soundness (in the sense of not introducing inconsistencies). To do 
this, we are required to exhibit a measure function that takes the same arguments as 
OMIC and whose value is a natural number that can be proved to decrease across 
recursive calls. In the present case, we use the function that returns the round 
number as the measure function. The final specification for OMIC is given below. 



5 Requiring this function to be implemented by a majority vote overspecifies the problem. All 
that is really required is that if the good processors form a majority in caucus, and if all the good 
processors have the same value in the vector, then that is the value of the Majority function. Taking 
the median of the values of the members of caucus (assuming they come from an ordered set) would 
also satisfy this specification (as was correctly noted by Lamport, Shostak and Pease [17, page 388]). 
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The next step is to convert our specification of the basic Oral Messages algo- 
rithm from the Interactive- Consistency to the Byzantine Generals formulation. We 
specify the Byzantine Generals form by a function OMBG that is similar to OMIC, 
but takes an additional (first) argument giving the identity of the Transmitter (or 
Commanding General) , and replaces the fcuvector of private values by a single pri- 
vate value (that of the Transmitter). The result returned by OMBG is a simple 
fcuvector : 

OMBG(G, m, t, caucus)(p) 

is processor p’s opinion of the Transmitter G’s private value t following an m-round 
exchange. If we assume that OMIC is available, then the behavior required of 
OMBG in the case r > 0 can be derived directly from that of OMIC: 


r > 0 

D OMBG(G, r, t, caucus)(p) 

= IF p = G then send(*, G,p) 

ELSE 

Majority (caucus — {<?}, 

OMIC(r — l,(Az: send(C G, z)), caucus — {g})(p)) 

ENDIF 


The next step is to replace the inner call to OMIC by one to OMBG. Now 

OMIC(r - 1, (Az : send(2, G, z)), caucus - { q})(p ) 

is an fcuvector giving processor p’s opinion of the values received by each processor 
when G sends them t. Using OMBG, p’s opinion of the value received by processor 
z in this circumstance can be written 

OMBG(z,r - 1, send(2,G, z), caucus — {^})(p) 

(i.e., z takes the part of the Transmitter, distributing the value send(t y G, z) received 
from the “real” Transmitter). Thus, the required fcuvector giving p’s opinion of the 
values received by all such processors z is given by: 

(A z : OMBG(z, r — 1, send (t, G, z)> caucus - (?})(p)). 

In this way we arrive at the specification for OMBG shown below. 
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OMBG(G,r, t, caucus) : RECURSIVE fcuvector = 

IF r = 0 THEN (A p : send (t,G,p)) 

ELSE 

(A P ■ 

IF p = G THEN send(t, G,p) 

ELSE 

Majority(caucus — {G}, 

(A z : OMBG(z, r - 1, send(t, G, z), caucus - {G})(p))) 

endif) 

ENDIF 

measure (A G,r,t, caucus — ♦ nat : r) 

The final step is to transform this specification into one for OMH. The only dif- 
ferences between OMBG and OMH are that the latter uses HybridMajority instead 
of the simple Majority function, and “wraps” and “unwraps” the values sent and 
received in the recursive calls with the functions R and UnR, respectively. Thus the 
specification for OMH given below is easily derived. 


OMH(G, r,t, caucus) : recursive fcuvector = 

IF r = 0 then (A p : send(t, G,p)) 

ELSE 

(A p : 

if p = G then send(t,G,p) 

ELSE 

UnR(HybridMajority(caucus - {G}, 

(A z : OMH(z, r - 1, R(send(< ,G,z)), caucus - {G})(p)))) 

endif) 

endif 

MEASURE (A G,r,t, caucus-* nat: r) 


It remains to specify the properties required of the functions HybridMajority , 
UnR, and R. The function HybridMajority is intended to be like the previous (stan- 
dard) Majority function, except that all error values are excluded. Recall from our 
earlier discussion that the Majority function does not actually need to be a majority 
vote, so it is preferable to specify the properties required of it axiomatically. The 
same is true of HybridMajority, which is axiomatized below. Although these two 
properties are all that is required of an implementation of HybridMajority, we have 
provided a concrete implementation of HybridMajority based on the Boyer- Moore 
MJRTY algorithm [3], and proved that the axioms below are satisfied by this im- 
plementation. Thus the following may be considered axioms, or may be considered 
lemmas proven by appeal to the hybridmjrty theory described in Appendix A. 
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|gs(caucus)| > |as(caucus)| + |ss(caucus)| 

A (V p : g(p) A p G caucus D u(p) = t) 

A t ^ error A (V p : c(p) A p G caucus D u(p) = error) 
D HybridMajority(caucus, t;) = t 


The antecedent to the implication in this specification is complicated, but can 
be read as follows. The function HybridMajority takes two arguments, a set of pro- 
cessors (i.e., an fcuset ), which we call the caucus, and a vector mapping processors 
to values (i.e., an fcuvector). If the vector records the same value for all good pro- 
cessors in the caucus, and the vector records an error value for all manifest-faulty 
processors in the caucus, and there are more good processors in the caucus than 
the sum of arbitrary-faulty and symmetric-faulty processors in the caucus, then Hy- 
bridMajority returns the same value as that recorded in the vector for the good 
processors. Any implementation of HybridMajority that does in fact compute the 
true majority after casting out error values would satisfy this axiom. 

The second axiom states that the value returned depends only on the values 
recorded in the vector for the processors in the caucus. Although HybridMajority 
is a function, it could potentially be implemented in such a way that when there is 
no majority (i.e., when the antecedent to the implication above is false), the output 
depends on values of the vector corresponding to processors not in the caucus, or 
other irrelevant information contained in the arguments. The second axiom prohibits 
this kind of behavior. 


(V p : p G caucus D V\(p) = x>i (p)) 

D HybridMajority (caucus, ui) = HybridMajority(caucus, t> 2 ) 


The remainder of the specification consists of interesting properties of the OMH 
algorithm. Many of the following theorems are first defined as predicates, then a 
lemma asserting that this predicate is universal is proved by induction, and then a 
theorem giving the result in the form desired is derived from the lemma. This style of 
breaking a specification into a predicate and a separate lemma and theorem is quite 
useful in formal systems. Many other large specifications use this technique [30,37]. 

Note that in the semiformal specification there was a notion of the value actually 
sent by the transmitter. This is very close to the value of the function send(t,p,q ), 
although we have axiomatized send so that manifest-faulty processors “send” error. 
Thus the formal specification of Validity uses send where the semiformal specifica- 
tion uses case analysis and the notion of the value actually sent by symmetric-faulty 
processors. 
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The first big property is Validity , stating that if the transmitter is not arbitrary- 
faulty, then this algorithm achieves the same result as send . This captures the 
correct behavior when the transmitter is good (send delivers the correct value), 
symmetric-faulty (send delivers the same wrong value to all receivers, which then 
agree on this value), and manifest-faulty (send effectively delivers the value (error 
to all receivers which then agree on error as the transmitter’s value). 


a(q) 

A p £ caucus 
A q £ caucus 

A |caucus| > 2 x (|as(caucus)| + |ss(caucus)|) -f |cs(caucus)| + r 
D OMH(g, r, t, caucus)(p) = send(i,^,p) 


The next property is Agreement, which states that if two receivers are both good 
they will agree, whatever the status of the transmitter. 


9{P) 

a g(q) 

A p £ caucus 
A q £ caucus 
A z £ caucus 

A |caucus| > 2 x ()as(caucus)| + |ss(caucus)|) *f |cs(caucus)| + r 
A r > |as(caucus)| 

D OMH(z, r, caucus)(p) = OMH( 2 , r, t, caucus)(g)) 


The next property, Validity. final, instantiates the inductive validity property 
with the full set of processors. The property Validity-Corollary addresses the special 
case when the transmitter is good. In this case the correct value is agreed upon by all 
good receivers. The property Agreement. final instantiates the inductive agreement 
property with the full set of processors. 

The remainder of the specification addresses the special case when there are no 
symmetric-faulty nor arbitrary-faulty processors. In this case a somewhat better 
bound can be achieved with regard to manifest-faulty processors. In particular, the 
bounds given by the general versions of the theorems are not as good as can be 
achieved with simple non-Byzantine resilient algorithms. However, as described in 
Section 3.4, OMH does actually achieve optimal behavior in these cases, as shown 
by the alternative analysis described below. 

The same set of interesting properties described above is then repeated with the 
added assumption that there are no arbitrary-faulty nor symmetric-faulty proces- 
sors. The bounds proven in these cases deliver more resilience to manifest faults. 
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Note that in these analyses, there is at least one good receiver, and the remainder of 
the receivers are either good or manifest-faulty, so the good receivers will always win 
the majority vote. A detail is the extra requirement that there be more processors 
than rounds, since it is difficult to assert properties of the OMH algorithm run on 
the empty set of processors. 


Note that analogous “ArbitraryOnly” theorems also hold, giving optimal 
bounds, 6 although these bounds are trivial consequences of the general theorem. 
The analogous “SymmetricOnly” theorems would not give optimal bounds, as was 
discussed in Section 3.4. 


omh[m : nat,n: posnat,T: TYPE, error : T, Ft, UnR : [T — ► T]] : 
THEORY 
BEGIN 

ASSUMING 

act_ax : ASSUMPTION (V (t : T) : R(*) ^ error) 
unact-ax : assumption (V (t : T) : UnR(R(f)) = <) 

ENDASSUMING 

rounds : type = upto[m] 
t : VAR T 

feu : TYPE = belowfn] 
feuset : TYPE = setof[fcu] 
fcuvector : TYPE = [feu — ► T\ 

G,p,q y z : var feu 
t/,vi,t/2 ; V AR fcuvector 
caucus : VAR feuset 
r : VAR rounds 
IMPORTING 

finite_cardinality[fcu, n, identity [feu]], 
filters [feu], 

card jset [feu, n, identity [feu]], 
hybridmjrty[T, n, error] 


6 Pease, Shostak and Lamport [25] proved that at least 3a 4-1 processors are required to withstand 
a arbitrary faults. This result has been formally verified by Bevier and Young [2, 1] using the Boyer- 
Moore prover. 
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statuses: TYPE = {arbitrary, symmetric, manifest, good} 

status : [feu — * statuses] 

a(z) : bool = arbitrary(status(z)) 

s(z) : bool = symmetric(status(z)) 

c(z) : bool = manifest(status(z)) 

g(z) : bool = good(status(z)) 

as(caucus) : feuset = filter(caucus, a) 

ss(caucus) : feuset = filter(caucus, s) 

cs(caucus) : feuset = filter(caucus, c) 

gs(caucus) : feuset = filter(caucus, p) 

fincard_all : 

LEMMA 

| caucus] = |as(caucus)| + |ss(caucus)| 4- |cs(caucus)| + |gs(caucus)| 

send : [T, feu, feu — T] 

sendl : AXIOM g(p) 3 send (t,p,q) = t 

send2 : axiom c(p) 3 send (<,p,g) = error 

send4 : AXIOM s(p) 3 sen d(t,p,g) = send(t,p,z) 

send5 : LEMMA -» a(p) 3 send (t,p,q) = send(t, p,z) 

HybridMajority (caucus, t>) : T = proj -l(Hybrid_mjrty (caucus, v } n)) 

HybridMajority_axl : 

LEMMA 

|gs(caucus)| > |as(caucus)| + |ss(caucus)| 

A (V p : g(p) A p G caucus 3 v(p) = t) 

A t ± error A (V p : c(p) A pG caucus 3 u(p) = error) 

3 HybridMajority(caucus, v) = t 

HybridMajority _ax2 : 

LEMMA 

(V p : p G caucus 3 v\ (p) = u 2 (p)) 

3 HybridMajority(caucus, v x ) = HybridMajority(caucus, v 2 ) 
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OMH(G, r, caucus) : RECURSIVE fcuvector = 

IF r = 0 THEN (A p: sen d(<,G,p)) 

ELSE 

(A p : 

IF p = G THEN send (t } G,p) 

ELSE 

UnR(HybridMajority(caucus — {G}, 

(A g : OMH(^,r — 1, R(send(J, G, </)), caucus 

endif) 

ENDIF 

measure (A G, r, t, caucus — ► nat : r) 

Validity _Prop(r) : 
bool 

(V p, q , caucus, 2 : 

a(g) 

A pE caucus 
A q £ caucus 
A |caucus| 

> 2 x (|as(caucus)| + |ss(caucus)|) + |cs(caucus)| + r 
3 OMH(g, r,/, caucus)(p) = send(<, <j,p)) 

Validity : lemma Validity _Prop(r) 

Agreement_Prop(r) : 
bool 

(V p, q t z, caucus, t : 

(s(p) 

A </(?) 

A p 6 caucus 
A <7 £ caucus 
A z £ caucus 
A |caucus| 

> 2 x (|as(caucus)| + |ss(caucus)|) + |cs(caucus)| 

+ r 

A r > |as(caucus)|) 

D OMH(z, r, t, caucus)(p) = OMH(z, caucus)^)) 

Agreement : LEMMA Agreement.Prop(r) 


{G})(P)))) 
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Validity -Final : 

THEOREM 

g(p) 

A -> a(G) 

A |a| < m 

A 2 x |a| + 2 x \s\ + |c| + m < n 
3 OMH(G, m, t, fullset[fcu])(p) = send (t,G,p) 

Validity .Corollary : 

THEOREM 

g{p) 

A g(G) 

A |a| < m 

A 2 x |a| + 2 x |s| + |c| + m < n 
3 OMH(G,m,i,fullset[fcu])(p) = t 


Agreement-Final : 

THEOREM 

g(p) 

A $(<?) 

A |a| < m 

A 2 x |a| + 2 x |s| + |c| + m < n 

3 OMH(G, m, i , fullset[fcu])(p) = OMH(G, m, <, fullset[fcu])(g) 

Crash-Only-Validity -Prop(r) : 
bool 


(V p, q, caucus, t : 

g(p) 


A p E caucus 
A q E caucus 
A |as(caucus)| = 0 A 
D OMH(<y } r, t , caucus)(p) = send(t ( g,p)) 


|ss(caucus)| = 0 A |caucus| 


> r 


Crash-Only -Validity : LEMMA Crash_Only_Validity-Prop(r) 
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Crash-Only _Agreement_Prop(r) : 
bool 

(V p, q, z, caucus, t : 

9(P) 

^ 9{q) 

A p £ caucus 
A q £ caucus 
A z £ caucus 

A |as(caucus)| = 0 A |ss(caucus)| = 0 A |caucus| > r 
D OMH(z, r, i, caucus)(p) = OMH(z,r,<, caucus)^)) 

Crash-Only -Agreement : lemma Crash-Only _Agreement_Prop(r) 

Crash.Only_Validity .Final : 

THEOREM 

g(p) A |a| = 0 A |s| = 0 A |fullset[fcu] | > m 
D OMH(G, m, t, fullset[fcu])(p) = send(<,G,p) 

Crash_Only_Validity -Corollary : 

THEOREM 

9(p) A g{G) A |a| = 0 A \s\ = 0 A |fullset[fcu]| > m 
D OMH(G, m,<,fullset[fcu])(p) = t 

Crash-Only-Agreement-Final : 

THEOREM 

9(P) A g(q) A |a| = 0 A \s\ = 0 A |fullset[fcu]| > m 

D OMH(G, m, t, fullset[fcu])(p) = OMH(G 1 m,< ) fuIlset[fcu])( 9 ) 

END omh 


4.2 Formal Verification 


The formal verifications corresponding Lemma 1 and Theorems 1 and 2 are proved 
by induction on the number of rounds, and follow the informal proofs quite closely. 
The theorem prover of PVS with its built-in arithmetic decision procedures and 
rewriting allowed the formal proof to be constructed at a relatively high level with- 
out being mired in detail. The PVS system allows partial proofs to be replayed under 
alternative assumptions, facilitating the exploration of generalizations and special 
cases, such as that reported in Theorem 2, formally reflected in the Crash-Only vari- 
ants of the theorems. Another example of this sort of exploration was the removal 
of the assumption that error values are disjoint from good data values. The proof 
of each lemma in the specification is described abstractly below. 
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The first lemma, fincard.all, states that the cardinality of an entire set of proces- 
sors is equal to the sum of the cardinalities of the processors in that set of each status. 
This lemma follows from properties implicit in the definition of statuses: that they 
are inclusive and disjoint. In detail, the formal proof requires 30 user-supplied steps 
in PVS, seven of which are ground or assert, which invoke the ground decision 

procedures of PVS. 

The second lemma, send5 , states that all non- arbitrary-faulty processors exhibit 
symmetric sending behavior. Informally, the proof of this property appeals to the 
fact that the four statuses— arbitrary, symmetric, manifest, and good are inclusive. 
Case analysis and appeal to the send axioms sendl , send2, and send4 essentially 
completes the proof. In two cases, such as that when the transnutter is manifest- 
faulty, the relevant axiom must be applied twice. The entire formal proof comprises 
14 user-supplied steps in the PVS interactive verification system. 

The most complicated proof constructed for this specification is for Lemma 1 
(called Validity in the formal specification), stating that if the transmitter is not 
arbitrary-faulty then all good receivers end up with the value actually sent by the 
transmitter. This proof contains 13 invocations of lemmas and axioms, most of them 
basic lemmas from the prelude and axioms from the OMH theory. The lemmas and 
axioms cited in the proof are: induction ; statuses.inclusive , and statuses.disjoint , 
which are the automatically-generated assertions that the set of four statuses are 
inclusive and disjoint; fincard .remove, a lemma giving the cardinality of a set after 
an element has been removed (used often-thirteen times); fincard.filter , a lemma 
asserting that the cardinality of a set is not less than the cardinality of that set 
with some elements removed; fincard.all, described above; send5, described above; 
remove-comm, a lemma asserting that the order in which elements are removed from 
a set is immaterial, used four times; and axioms (such as the definitional axioms 
for send) brought into the proof explicitly six times. The entire proof consists of 80 
user-suggested steps, 15 of which are PVS assert commands, which invoke rather 
powerful decision procedures for ground arithmetic [31,32]. After some experimen- 
tation with alternative specifications, including constructing partial failed V TOO J s ° 
this lemma for alternative versions of the algorithm OMH, the first proof of Validity 
was constructed from scratch in a few hours. 

The proof of the crash-only variant of Validity is very similar in nature to the 
general version of validity, although it contains one fewer user-supplied steps, 14 of 
which are assert. 

The proof of the Agreement property contains 10 invocations of lemmas and 
axioms, and consists of 73 steps, including 13 calls to the ground decision procedures. 
This proof was constructed from scratch in a few hours. 

Table 4.1 summarizes some gross measures of the size and difficulty of construct- 
ing proofs for the lemmas and theorems of this specification. The first column is 
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formula name 

user-supplied 

steps 

number of 
inductions 

uses of 
assert 

fincard.all 

30 

i 

7 

sends 

14 

0 

1 

Validity 

80 

1 

15 

Agreement 

73 

1 

13 

Validity -final 

36 

0 

4 

Validity.Cor 

14 

0 

2 

Agreement .final 

51 

0 

4 

Crash-Only _ Validity 

79 

1 

14 

Crash-Only -Agreement 

41 

1 

6 

Crash-Only _ Valdity -final 

29 

0 

0 

Crash-Only _ Validity.Cor 

8 

0 

1 

Crash-Only-Agreement-final 

21 

0 

0 

Hybridmajority_axl 

78 

2 

16 

Hybridmajority-ax2 

20 

1 

3 


Table 4.1: Statistics for the Proofs Performed 

the name of the formula concerned. The second column is the total number of 
user-suggested proof steps in the final proof. The third column counts the uses of 
induction. The fourth column counts the uses of ground or assert, which invoke the 
ground decision procedures; these roughly correspond to the number of significant 
branches in a proof. 

The critical measure, however, for specification and verification tasks is not the 
size of individual proofs, but the total time taken from problem understanding 
through complete formal proof. The effort reported here took less than a month 
of part time work, including the exploration of flawed modifications to Algorithm 
Z that seemed informally plausible, and a change in notation for expository pur- 
poses. Producing this report took far more time than the formal specification and 
verification combined. 

Full machine-readable PVS specifications and PVS proofs of the entire proof 
chain are available from the authors. 
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4.2.1 Portion of PVS Proof of Validity 

In order to give an idea of the formal proof as interactively developed using PVS, we 
reproduce a prettyprinted version of such a proof (slightly edited for readability). 
It begins with the theorem name, Validity, and the initial sequent. Sequents 
are presented as a list of numbered hypotheses, a horizontal line, and a list of 
numbered conclusions. One may read a sequent as stating that the conjunction 
of the hypotheses implies the disjunction of the conclusions. Initially, there are 
no hypotheses, and only one conclusion, stating that for any number of rounds, 
the inductive validity property holds for that many rounds. The first step in the 
proof is the application of induction on the number of rounds. This leads to two 
subgoals, called Validity. 1 and Validity. 2. For Validity. 1, the base case of 
the induction, the definition of the inductive validity property is expanded and the 
result is skolemized (fresh constants are introduced in place of universally quantified 
variables) Then the definition of OMH is expanded and reduced. (Recall the 
specification given earlier; zero-round OMH reduces to send.) This completes the 
branch of the proof corresponding to the base case of the induction. 

The remaining branch, called Validity. 2, is then proved. This branch requires 
it to be shown that for any number of rounds r, if the inductive validity property 
holds for r rounds, then it also holds for r + 1 rounds. By skolemizing, expanding 
definitions, and applying propositional simplification, we arrive at the crux of the 
proof Here we must show that if OMH behaves correctly at t- rounds, then the 
UnR of the HybridMajority of the result of all other receivers utilizing OMH to 
broadcast R of the value they received from the transmitter is the same as the value 
actually sent by the transmitter. This must be demonstrated under certain other 
assumptions, such as that are enough nonfaulty processors, and that the transmitter 
is not arbitrary faulty. The proof proceeds by utilizing a property of HybndMajor- 
ity, called HybridMajority.axl . There are four hypotheses of this property, and one 
conclusion. After quantifying appropriately, the proof is split into five cases, corre- 
sponding to a proof of each hypothesis of HybridMajority. ax 1 and a proof from the 
conclusion of HybridMajority.axl to the conclusion of the actual property of inter- 
est. The proof of the latter (Validity .2. 1) proceeds by bringing in the assumption 
unact. ax, hiding some irrelevant formulas, and invoking the decision procedures. 
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Validity: 


{1} (V (r : rounds): Validity _Prop(r)) 

Inducting on r yields 2 subgoals: Validity. 1 and Validity. 2 
Validity. 1: 


{1} Validity _Prop(0) 

Expanding the definition of Validity .Prop 
Validity . 1 : 


{1} (V ( p,q : feu), (caucus: fcuset),(f: T) : 

-« arbitrary(status(#)) 

A caucus(p) 

A caucus(g) 

A |caucus| 

> 2 x |filter(caucus, a)| + 2 x |filter(caucus, s)| 

+ |filter(caucus, c)| 

+ 0 

D OMH(tf,0,*,caucus)(p) = send(<,g,p)) 

For the top quantifier in 1, we introduce Skolem constants: (p\ q', caucus', t f ) and 
apply disjunctive simplification to flatten the sequent, 

Validity . 1 : 

{ — 1} caucus' (p') 

{ — 2} caucus'^') 

{—3} |caucus'| 

> 2 x |filter(caucus\a)| + 2 x |filter(caucus', s)\ 

+ |filter(caucus', c)| 

+ 0 


{1} arbitrary(status(<7')) 

{2} OMH(g', 0,*', caucus')(p') = send(/', q\p f ) 

Expanding the definition of OMH completes the proof of Validity . 1. 
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Validity . 2 : 


{1} (V (r : upto[m]) : r < m A Validity _Prop(r) D Validity _Prop(r + 1)) 

For the top quantifier in 1, we introduce Skolem constants: ( r ') and apply disjunctive 
simplification to flatten the sequent, 

Validity. 2: 

{-1} r' < m 

{-2} Validity -Prop(r') 


{1} Validity _Prop(r' + 1) 

Expanding the definition of Validity .Prop in formula 1 
Validity .2: 

{-1} r' < m 

{-2} Validity -Prop(r') 


{1} (V (p, q ■■ feu), (caucus : fcu9et),(t: T) : 

-i arbitrary(status(g)) 

A caucus(p) 

A caucus(q) 

A | caucus | 

> 

2 x |filter(caucus, a)| + 2 x |filter(caucus, s)| 
+ |filter(caucus,c)| 

+ r' 

+ 1 

3 OMH(?,r' + l.t, caucus) (p) = send(t,g,p)) 
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For the top quantifier in 1, we introduce Skolem constants: (p', q', caucus', t f ) and 
apply disjunctive simplification to flatten the sequent, 

Validity. 2: 

{ — 1} r f < m 

{ — 2} Validity _Prop(r') 

{—3} caucus' (p l ) 

{—4} caucus'(g') 

{ — 5} |caucus'| 

> 

2 x |filter(caucus', a)\ + 2 x |filter(caucus / , s)\ 

+ |filter(caucus', c)\ 

+ r' 

Hb 1 


{1} arbitrary (status(y)) 

{2} OMH(g',r' + 1, caucus / )(p / ) = send (t',q\p') 
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Expanding the definition of OMH 

Validity .2: 

{ — 1} r' < m 

{-2} Validity .Prop(r') 

{—3} caucus'(p') 

{—4} caucus'(g') 

{—5} |caucus'| 

> 

2 x |filter(caucus',a)| + 2 x |filter(caucus', s)| 
+ |filter(caucus', c)| 

+ r' 

+ 1 


{1} arbitrary(status(</')) 

{2} IF p' = q' THEN send(<',g',p') 

ELSE 

UnR(HybridMajority(caucus' - {?'}, 

(A ( q : feu) : OMH(?, r', R{aend(t', q' , q)), caucus' - {?'})(?')))) 

ENDIF 

= send(t', p') 
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Lifting IF-conditions to the top level, 
and by propositional simplification, 

Validity. 2: 

{ — 1 } r' < m 

{ — 2} Validity _Prop(r') 

{—3} caucus'(p') 

{—4} caucus'(g') 

{ — 5} | caucus' | 

> 

2 x |filter(caucus', a)| + 2 x |filter(caucus', s)| 
+ |filter(caucus', c)| 

+ r' 

+ 1 


{1} P' = q' 

{2} UnR(HybridMajority(caucus' — {y'}, 

(A ( q : feu) : OMH( 9 , r', R(send(t', q', q)), caucus' - {g'})(p'))» 

= send(t',q',p') 

{3} arbitrary(status(g')) 
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Applying HybridMajority-axl 
Validity. 2: 

{-1} (V (caucus: fcuset),(t : T),( v: fcuvector) : 

|gs(caucus)| > |as(caucus)| + |ss(caucus)| 

A (V (p : feu) : p(p) A p G caucus D v(p ) = t) 

A t ± error A (V (p : feu) : c(p) A p G caucus D v(p) = error) 
3 HybridMajority(caucus,v) = t) 

{-2} r' < m 

{—3} Validity .Prop(r') 

{—4} caucus' (p') 

{—5} caucus'(g') 

{—6} | caucus' | 

2 x |filter(caucus',a)| + 2 x |filter(caucus',s)| 

+ |filter(caucus', c)| 

+ r' 

+ 1 


{ 1 } 

{ 2 } 


p' = q' 


UnR(HybridMajority(caucus' — {<?'}, , 

(A ( q : feu) : OMH( 9 , r', R(send(t' , q' , q)), caucus - {? })(p 

= send (t',q',p') 


{3} arbitrary(status(g')) 


Instantiating the top quantifier in -1 with the terms, 
remove^', caucus') 

R(send (t’,q',p')) 

(A (q : feu) : OMH( ? ,r', R(send(t', q', q)), remove^', caucus ))(p)) 
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Validity. 2: 

{-1} |gs(caucus' - {g'})| 

> |as(caucus' - {<?'})[ 4- |ss(caucus' — {g'})| 

A 

(V ( p : feu) : 

g(p) A p E caucus' - {g'} 

D 

(A (g : feu) : 

OMH(g, r', R( send(*', g', q)) } caucus' - {?'})(p'))(p) 
= fl(send(<',gV))) 

A /£(send(t', g',p')) / error 
A 

(V (p : feu) : 

c(p) A p G caucus' — {g'} 

D 

(A (g : feu) : 

OMH (g, r', iZ(send(^', q\ q )), 

caucus' - {g'})(p'))(p) 

= error) 

D 

HybridMajority(caucus' - {g'}, 

(A (q : feu) : 

OMH(g, r', i?(send(/', g', q)) } 
caucus' - {g'})(p'))j 

= i?(send(<', g',p')) 

{ — 2} r' < m 

{ — 3} Validity _Prop(r') 

{—4} caucus' (p') 

{ — 5} caucus' (g') 

{—6} [caucus' | 

> 

2 x |filter(caucus', a)| + 2 x |filter(caucus', s)\ 

+ |filter(caucus', c)| 

+ r' 

+ 1 


{i} p' = </' 

{2} UnR(HybridMajority(caucus' — {?'}, 

(A ( q : feu) : OMH(g, r', ^send^', q' , q)), caucus' - {?'})(?')))) 


= send(f / , q',p') 
{3} arbitrary(status(g')) 
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Splitting conjunctions yields 5 subgoals: 

Validity .2. 1 : 

{-1} HybridMajority (caucus' - {?'}, 

(A (q : feu) : 

OMHfa,r',.R(send(t', q',q)), 
caucus' - {g'})(p'))) 

= ft(send(<',g',p')) 

{-2} r' < m 

{—3} Validity _Prop(r') 

{—4} caucus' (p') 

{ — 5} caucus'(g') 

{—6} |caucus'| 

> 

2 x |filter(caucus',a)| + 2 x |filter(caucus',s)| 
+ |filter(caucus' , c)| 

+ r' 

+ 1 


{1} P' = ?' 

{2} UnR(HybridMajority(caucus' - {?'}, 

(A ( q : feu) : OMH(g, r', fi(send(t', q', q)), caucus' - {?'})(?')))) 

= send p') 

{3} arbitrary(status(g')) 
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Applying unact_ax and instantiating the top quantifier with the term: send(£', g',p') 
and hiding some formulas, 

Validity .2. 1 : 

{ — 1} UnR(i£(send(<', q',p f ))) = send 

{—2} HybridMajority(caucus' — {g'}, 

(A (q : feu) : 

OMH(?y,fl(send(/', ?',$)), 
caucus' - {q’})(p'))) 

= #(send(<', q' ,p f )) 


{1} UnR(HybridMajority(caucus / — {$'}, 

(A (q : feu) : OMH (q, r', R(send(t' , q\ q)) t caucus' - {?'})(p')))) 

= send (t f ) q\p f ) 

Invoking decision procedures completes the proof of Validity .2 . 1. 


The remainder of the proof takes up over one hundred pages of printed text, and 
is omitted here. Full machine-readable PVS specifications and PVS proofs of the 
entire proof chain are available from the authors. 


4.3 PVS Proof Chain Analysis 

Here we reproduce a summary of the PVS analysis of the entire chain of proof for 
the verification conducted. Following the summary is a detailed description of all 
definitions, axioms, assumptions, lemmas, and theorems used implicitly or explicitly 
in three example proofs. 


Proof summary for theory omh 

IMPORTING 1_TCC1 

f incard_all 

sendS 

Hybr idMa j ority.TCCl 

HybridMajority_axl 

HybridMaj or ity_ax2 

OMH.TCCl 

OMH JTCC2 


proved - complete 
proved - complete 
proved - complete 
proved - complete 
proved - complete 
proved - complete 
proved - complete 
proved - complete 
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Val idity _Prop_TCC 1 proved - complete 

Validity proved - complete 

Agreement _Prop_TCCl proved - complete 

Agreement proved - complete 

Validity.Final proved - complete 

Validity.Final.TCCl proved - complete 

Validity_Final_TCC2 proved - complete 

Validity .Corollary proved - complete 

Val idity _Corollary_TCC 1 proved - complete 

Validity_Corollary_TCC2 proved - complete 

Agreement .Final proved - complete 

Crash.Only.Validity proved - complete 

Crash.Only.Agreement proved - complete 

Crash.Only.Validity .Final proved - complete 

Cr ash.Only. Val idity .Final.TCCl proved - complete 

Crash.Only.Validity _Final_TCC2 proved - complete 

Crash.Only .Validity .Corollary proved — complete 

Cr ash.Only .Agreement.Final proved - complete 

Theory totals: 26 formulas, 26 attempted, 26 succeeded. 

The first example analyzed in detail is send5, which depends only on the axiom- 
atization of send and status. 

send5 has been PROVED . 

The proof chain for sendB is COMPLETE. 

send5 depends on the following axioms: 
omh . statuses.inclusive 
omh . send4 
omh . s end2 
omh. s end 1 

send5 depends on the following definitions: 
omh.s 
omh. c 
omh.g 
omh. a 

The second detailed analysis presented here is of the final validity theorem cor- 
responding to Lemma 1 in the semiformal proof described earlier. This theorem is 
essentially proved by appeal to the inductive version of validity, which has a long and 
complicated proof whose beginning was presented in Section 4.2.1. By transitivity of 
dependencies, this final version of validity depends on all the definitions and axioms 
that the inductive version of validity depends on, plus a few more. Note that this 
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proof also depends on several lemmas and definitions from the PVS prelude, such 
as the definitions of fincard and filter. Lemmas and definitions from the prelude are 
cited in these proof chain analyses, but the axioms of propositional logic, equality, 
the lambda calculus, and linear arithmetic used implicitly by the ground decision 
procedures are not identified in this way. 


Validity .Final has been PROVED. 

The proof chain for Validity_Final is COMPLETE. 

Validity.Final depends on the following proved theorems: 
bounded. induct ion. upto. induct ion 
omh. Validity 

f inite_cardinality .fincard.TCCl 
hybridmjrty . Hybrid.mjrty.TCCl 
omh. sends 

hybridmjrty . Hybrid_mjrty_TCC3 
omh. 0MH.TCC1 

f inite.cardinality .f incardi_TCC3 
card_set .fullset.f incard 
hybridmjrty . count _votes_TCC4 
f inite_cardinality .f incardi.TCCl 
hybridmjrty . count _votes_TCC3 
hybridmjrty . Hinv_holds 
omh.0MH.TCC2 

hybridmjrty . Hybrid.mjrty.TCC2 

omh . Validity. Final_TCC2 

hybridmjrty . Hlosers 

hybr idmj rty . Hwinner 

omh. IMPORTING 1.TCC1 

hybridmjrty . count. votes.TCCS 

omh.HybridMajority. 2 Lxl 

hybridmjrty . count .all.good_votes.TCC2 

identity . I.TCC2 

hybridmjrty . Hybrid.mjrty.TCC5 

hybr idmj rty . count .votes _TCC2 

hybridmjrty . count _2Lll.good_votes.TCC3 

omh . Validity.Prop.TCCl 

c 2 Lrd.set . remove.coram 

hybridmjrty . count .votes.TCCl 

f inite.cardinality .f incardi_TCC2 

card.set . remove.prop 

omh . Validity.Final.TCCl 

card.set . f incard.remove 

hybridmjrty . count _ 2 Lll_good_ votes.TCCl 

omh . f incaxd.all 
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f inite.cardinality . f incardi_TCC4 
omh . Hybr idMa j or it y _TCC 1 
card_set . f incar d.fil ter 
hybridmjrty . count_all_good_votes_TCC4 

Validity_Final depends on the following axioms: 
omh.send2 

omh. statuses_dis joint 
omh. sendl 

omh . statuses.inclusive 
omh. send 4 

Validity_Final depends on the following definitions: 
identity. I 
omh. c 

f inite.cardinality . f incard 

sets .fullset 

omh.gs 

filters .filter 
omh. a 

omh . Validity_Prop 
identity. identity 
hybridmjrty . Hinv 
omh. cs 

hybridmjrty .Hybr id.mjrty 
omh . HybridMa j or ity 
sets .member 

f inite.cardinality .f incardi 
hybridmjrty . count.votes 
omh.g 
omh. as 

hybridmjrty . count_all_good_votes 
omh . OMH 
omh. ss 
omh. s 

sets .remove 

Validity.Final depends on the following assumptions: 
omh. unact. ax 
omh. act _ ax 
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The third detailed analysis is that of the agreement property. The proof of this 
theorem depends on the inductive version of agreement, which in turn depends on 
the inductive validity property. 

Agreement.Final has been PROVED. 

The proof chain for Agreement .Final is COMPLETE. 


Agreement.Final depends on the following proved theorems: 
bounded.induct ion. upto. induct ion 
omh. Validity 

finite_c2Lrdinality.fincard.TCCl 
hybridmjrty . Hybr id.mjrty.TCCl 
omh. sendS 

hybridmjrty . Hybrid.mj rty_TCC3 
omh . 0MH.TCC1 

f inite.cardinality . f incardi_TCC3 
card. set . fulls et.f ineaxd 
hybridmjrty . count. votes.TCC4 
f inite.cardinality . f incardi.TCCl 
hybridmjrty . count. votes.TCC3 
hybridmjrty . count.votes.TCCl 
omh . 0MH.TCC2 

hybridmjrty. Eybrid.mjrty.TCC2 

hybridmjrty . Hlosers 

omh . Validity.Corollary.TCCl 

hybridmjrty . Hwinner 

omh . IMP0RTIHG1.TCC1 

hybridmjrty . count.votes.TCC5 

omh. HybridMajority.axl 

hybridmjrty . count.all_good_votes.TCC2 

omh. Agreement 

identity . I.TCC2 

omh . Agreement .Prop.TCCl 

hybridmjrty . count.votes_TCC2 

hybridmjrty . count _2Lll.good_votes.TCC3 

omh . Validity. Prop.TCCl 

caxd.set . remove.comm 

hybridmjrty . Hybrid.mj rty.TCCS 

hybridmjrty . Hinv.holds 

omh . Hybr idMaj ority_ax2 

caxd.set . f in caxd.non. empty 

omh . Validity.Corollary.TCC2 

finite_CcLrdinality.fincardi.TCC2 

caxd.set . remove.prop 

caxd.set . f incard.remove 
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hybridmjrty . count _all_good_votes_TCCl 

omh . f incard. all 

f init e.cardinality . f incardi_TCC4 
omh . HybridMa j or ity.TCCl 

card.set .f incard.f ilter 

hybridmjrty . count _all.good_votes.TCC4 

Agreement .Final depends on the following axioms . 
omh. send2 

omh. statuses.dis joint 
omh. sendl 

omh . st atuses.inclusive 
omh. send4 

Agreement .Final depends on the following definitions: 
identity . I 
omh. c 

f init e.cardinality . f incard 
sets .fullset 
omh.gs 
omh. a 

omh . Validity. Prop 
identity . identity 
hybridmjrty .Hinv 

omh. cs 

hybridmjrty . Hybrid.mjrty 
omh . HybridMa j ority 
sets .member 

f inite.cardinality .f incardi 
hybridmjrty . count.votes 
omh.g 

omh . Agreement. Prop 
omh. as 

filters.! ilter 

hybridmjrty . count .all.good.votes 
omh . OMH 
omh. ss 
omh. s 

sets .remove 

Agreement .Final depends on the following assumptions: 
omh.unact.ax 
omh . act. ax 

The complete detailed proof chain analysis of every lemma from the omh theory 
is over 20 pages in length. 



Chapter 5 

Conclusions 


Tools for formal verification have matured to the point where complex, practically 
interesting aspects of systems can be economically verified. The human effort re- 
quired to specify and prove in complete formal detail interesting theorems about 
fault- tolerant architectures is quite modest. In this report we have presented the 
formal verification of a new algorithm for Byzantine Agreement under a hybrid fault 
model. 

Thambidurai and Park’s hybrid fault model extends the design and analysis 
of Byzantine fault-tolerant algorithms in an important and useful way. Hybrid 
fault-tolerant algorithms can tolerate greater numbers of “simple” faults than clas- 
sical Byzantine fault- tolerant algorithms, without sacrificing the ability to with- 
stand Byzantine, or arbitrary, faults. We applied our formal verification tools to 
this domain, discovering errors in published proofs and in a proposed algorithm for 
Byzantine Agreement under this fault model. 

A crucial tool in our detection of the flaw in Thambidurai and Park’s algorithm, 
and also in detecting flaws in our own early attempts to repair this algorithm, was 
our use of mechanically-checked formal verification. The discipline of formal speci- 
fication and verification was also instrumental in helping us to develop the correct 
algorithm presented here. The rigor of a mechanically-checked proof enhances our 
conviction that this algorithm is, indeed, correct, and also helped us develop the 
informal, but detailed, proof given here in the style of a traditional mathematical 
presentation. 

It is worth repeating that no formal verification proves any program “correct.” 
At most, a model of the program is shown to satisfy a specification, and shown 
to exhibit certain properties under a certain set of assumptions. The true benefit 
of formal specification and verification is not in getting a theorem prover to say 
proved, but rather in refining one’s understanding through dialogue with a tireless 
mechanical skeptic. 
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Chapter 5. Conclusions 


The effort required to perform this formal verification was not particularly large 
and did not seem to us to demand special skill. We attribute some of this ease in 
performing formal verification of a relatively tricky algorithm to the effectiveness of 
the tools employed [22]. These tools (and others that may be of similar effective- 
ness) are freely available. In light of the flaws we discovered in Thambidurai and 
Park’s algorithm, and had previously found in the proofs for other fault-tolerant 
algorithms [24,29,30], we suggest that formal verification should become a routine 
part of the social process of development and analysis of fault-tolerant algorithms 
intended for practical application in safety-critical systems. 

In future work, we hope to explore possible extensions to the OMH algorithm 
and its analysis to include communication faults, and to see whether larger numbers 
of symmetric faults can be tolerated. We also intend to study whether lower message 
complexity can be achieved in cases of practical interest, and to examine alternative 
architectures employing fewer processors (we have already formally specified and 
verified a variant of OMH(l) for the asymmetric Draper FTP architecture [13]). 
We also plan to formally verify a modified version of the Interactive- Convergence 
Algorithm for clock synchronization using a hybrid fault model that includes com- 
munication faults (we have already formally verified the standard algorithm [28], 
and have an informal analysis of the modified version). 

Also, although our experience indicates that formal verification is an effective 
debugging technique, it is undeniably expensive one, and it is interesting to ask 
whether other methods could have identified the flaws in Z and its derivatives more 
simply or economically. Our previous experience with other mechanically- checked 
verifications is consistent with the experience reported here: the effort spent discov- 
ering and repairing flaws in a specification, algorithm, or proof is a large part of the 
intellectual effort expended on formal verification projects. 

An obvious alternative is testing: our specifications of these algorithms can be 
easily translated into Lisp or other higher-order functional languages where they can 
be run on a variety of test cases. (Obviously, controlled “fault-injections” will need 
to be programmed into the executable algorithms.) Without specifying a particular 
strategy for determining test cases, we cannot say whether specific flaws could have 
been detected in this way or not. However, it is clear that with less than complete 
test coverage, one cannot guarantee that all errors will be discovered. 

Between testing and conventional verification lie the state-exploration methods. 
These methods resemble testing in that they are automatic; they resemble veri- 
fication in that they are formal verification methods. State-exploration methods 
systematically enumerate all the states of a finite-state algorithm and test whether 
certain predicates hold at those states. Recent techniques allow large numbers of 
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states to be handled in an efficient manner. 1 Several systems based on state ex- 
ploration are available; some of these exploit the close connection between finite 
state graphs and propositional temporal logic (when they are usually called “model 
checkers” [8]), others provide a higher-level language (e.g., Mur<£ [10, 19] uses a 
transition-rule language for concurrent systems that is loosely based on Chandy 
and Misra’s Unity model [7]). 

As it stands, OMH is not amenable to state exploration: it has far too many 
states. But for debugging, it could be sufficient to examine highly simplified versions 
of the problem: for example, the case m = 1, n < 6, and a very small set of data 
values — E, R(E), and three distinct “good” values — seem sufficient to detect all the 
errors that we discovered. 

Whereas conventional testing probes selected test cases of the full algorithm, 
state exploration provides complete coverage of simplified instances. We plan to 
examine the effectiveness of state exploration in this domain by conducting some 
experiments with OMH and related algorithms. 


'These techniques include hashing [12], and symbolic 
grams [5,6]. 


methods using Binary Decision Dia- 
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Appendix A 

Hybrid MJRTY 


In the proofs performed to verify the correctness of OMH, only two properties about 
HybridMajority are assumed. Any implementation of HybridMajority that satisfies 
these axioms would be acceptable for the purposes of OMH. 


|gs(caucus)| > |as(caucus)| + |ss(caucus)| 

A (V p : g(p) A p £ caucus 3 v{p) = t) 

A t ^ error A (V p : c(p) A p € caucus 3 v(p) — error) 
3 HybridMajority(caucus, v) = i 


(V p : p G caucus 3 t*i(p) = v 2 (p)) 

3 HybridMajority(caucus, vi) = HybridMajority(caucus, t> 2 ) 


These properties were described and justified in detail in Section 4.1. Nonethe- 
less, there is always a concern when properties are stated axiomatically that they 
might be unrealizable. We allay this concern by exhibiting an efficient implementa- 
tion for the HybridMajority function and proving that it satisfies the stated axioms. 

This development is specified in the PVS theory hybridmjrty (shown on page 62), 
which constructively specifies a function Hybrid.mjrty . and then shows that it can be 
used to satisfy the axioms stated above. The function Hybrid.mjrty is very similar to 
the Boyer-Moore MJRTY algorithm [3], except that it ignores error values. MJRTY 
is a method for finding the absolute majority (if there is one) of a set of values in 
linear time, using only equality comparison operations. (Other implementations of a 
majority function require more than linear time, and/or more complex comparisons). 
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The original MJRTY algorithm has been formally specified and verified before [3,4], 
The modified version below is based on a recent unpublished verification of MJRTY 
by Natarajan Shankar. 

MJRTY can be informally explained as a method to find the candidate, if there 
is one, with the majority of votes at a political convention. Every voter on the 
convention floor attempts to find someone who is voting for some other candidate. 
If two voters voting for different candidates meet, they annihilate each other. At 
the end of this process, all the remaining voters must support the same candidate. 
The key property of this procedure is that if there is a majority for some candidate 
in the beginning, then there will be some voters for that candidate left at the end. 
If there is no candidate with a majority, then there may or may not be any voters 
left at the end, and those left could be voting for any candidate, even the one with 
the fewest total votes. Thus, a second (linear) pass is necessary to ensure that the 
delegate supported by the voters remaining at the end does indeed have an overall 
majority. 

This violently parallel procedure can be sequentialized by ordering the voters, 
and then moving down the line of voters forming a “bandwagon” of like-minded 
voters. When this bandwagon finds that the next voter agrees with them, the 
bandwagon simply increases in size by one. When it finds that the next voter 
choses some other candidate, the bandwagon’s size is decreased by one. If the 
bandwagon becomes empty, then the next voter becomes a bandwagon of one, and 
the process continues. This procedure can be implemented by storing the candidate 
chosen by the current bandwagon, and a natural number representing the size of 
the bandwagon. 

It is straightforward to generalize this procedure to a version which respects 
caucuses and ignores errors, as required by OMH. The input to the generalized 
procedure is a set of votes possibly containing votes belonging to processors outside 
the caucus, or in error, which should both be ignored in selecting the majority. It 
would be possible to use the standard MJRTY function on a set of votes filtered to 
remove error and noncaucus votes, although such a specification would be somewhat 
farther from the most efficient implementation, requiring additional passes over the 
set of votes. The generalized procedure must also specify the value to return in the 
case that there are no nonerror votes in the caucus. 

The sequential algorithm is represented in our specification as the function Hy- 
brid.mjrty, which takes a caucus (a set of eligible voters), a poll (a function from all 
voters to their chosen candidate), and i (the total number of votes) as arguments 
and returns a pair consisting of the identity of a candidate, and a natural number 
standing for the size of the bandwagon of voters remaining at the end. In the base 
case, the default value is chosen to be the error value. The function count-votes 
counts the votes of a particular candidate, while count-all-good-votes counts the 
total number of nonerror votes in a caucus. 
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Appendix A. Hybrid MJRTY 


Hlosers is a lemma about Hybrid.mjrty stating that for all candidates A, if 
A is not the first element of the pair returned by Hybrid.mjrty , then A does not 
have a majority of good votes. Hwinner is closely related to Hlosers , stating that 
if a candidate has a majority of votes, then that candidate will be selected by 

Hybrid.mjrty. 

The properties Hlosers and Hwinner follow directly from the inductive invariant 
Hinv. The property Hinv compactly asserts two invariants at once. First, that 
twice the total of the selected candidates votes is less than the size of the current 
bandwagon plus the total number of good (nonerror, in caucus) votes. Second, 
that twice the total of the selected candidates votes plus the size of the current 
bandwagon is less than the total number of good votes. These invariants are true 
for all nonerror candidates. Hinv is proven by induction on the total number of 

votes. 


hybridmjrty[T : TYPE, n : posnat, error : T\ : THEORY 
BEGIN 

poll : VAR [below[n] — ► T] 
caucus : VAR setof[below[n]] 
i : VAR upto[n] 
j4,cand : var T 
P : VAR [T, below [n]] 

H ybridjnjrty (caucus, poll, ») : RECURSIVE [T, nat] = 

(if i = 0 THEN (error, 0) 

ELSE 

(LET P = Hybridjnjrty(caucus, poll, i - 1) IN 

(IF (-• caucus(* - 1)) V poll(* - 1) = error THEN P 
ELSIF poll(i - 1) = proj.l(P) THEN (proj.l(P), projJ2(P) + 1) 
ELSIF proj_2(P) > 0 THEN (proj.l(P), proj -2(P) - 1) 

ELSE (poll(* - 1), 1) 

ENDIF)) 

endif) 

MEASURE (A caucus, poll, i : i) 
count.votes(caucus,poll, cand,t) : RECURSIVE nat = 

(if i > 0 

THEN 

(IF caucus(i — 1) A poll(: — 1) — cand THEN 1 
+ count_votes(caucus, poll, cand, i - 1) 

ELSE count .votes (caucus, poll, cand, i — 1) 
endif) 

ELSE 0 
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endif) 

MEASURE (A caucus, poll, cand, i : i) 

count_all_good_votes(caucus, poll, i) : RECURSIVE nat = 

(if i > 0 

THEN 

(if caucus(i — 1) A (-i poll (2 — 1) = error) 

THEN 1 + count_all,good_votes(caucus, poll, i — 1) 
else count_all_good_votes(caucus, poll, i — 1) 
endif) 

ELSE 0 
endif) 

MEASURE (A caucus, poll, i : i) 

Hinv(caucus, poll, i) : 
bool 

(let P = Hybrid_mjrty(caucus, poll, *) IN 

(V A : 

(-» A = error) 

3 2 
x 

(count_votes(caucus, poll, A, i) 

+ (if A = proj_l(P) then 0 else proj_2(P) endif)) 
< proj_2(P) -f count_all-good_votes(caucus, poll, *))) 

Hinv_holds : lemma Hinv(caucus, poll, t) 

Hlosers : 

LEMMA 

A ^ proj_l(Hybrid_mjrty(caucus,poll, i)) A (-> A = error) 

3 2 x count_votes(caucus, poll, A, *) 

< count_all_good_votes(caucus, poll, i) 

Hwinner : 

LEMMA 
(V cand : 

(2 x count_votes(caucus, poll, cand, i) 

> count_all^good_votes(caucus, poll, i) 

A -i cand = error) 

3 proj_l(Hybrid_mjrty(caucus, poll, i)) = cand) 


END hybridmjrty 
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Appendix A. Hybrid MJRTY 


The proof of TCC’s and lemmas from the hybridmjrty theory have been com- 
pleted in PVS. Informally, there is only one significant lemma, HinvJiolds, which is 
proved by induction on t. In the base case, this holds by definition of the functions. 
In the inductive case, there are many cases, but each one is relatively straightforward 
to analyze. The remaining two lemmas are immediately provable from HinvJiolds. 
The following is the proof chain analysis from PVS. 


Proof summary for theory hybridmjrty 


Hybrid.mjrty.TCC5 proved 

Hybrid.mjrty.TCCl proved 

Hybrid.mjrty.TCC2 proved 

Hybrid.mjrty.TCC3 proved 

count.votes.TCCl proved 

count _votes_TCC2 proved 

count.votes.TCC3 proved 

count.votes.TCC4 proved 

count, votes JTCC5 proved 

count.all_good_votes.TCCl proved 

count. all_good.votes.TCC2 * proved 

count _all.good_votes.TCC3 proved 

count_all.good_votes.TCC4 proved 

Hinv .holds proved 

Hlosers proved 

H winner proved 


Theory totals: 16 formulas, 16 attempted, 16 succeeded. 


complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 

complete 


Hinv.holds has been PROVED. 

The proof chain for Hinv.holds is COMPLETE. 

Hinv.holds depends on the following proved theorems: 
hybridmjrty . Hybrid_mjrty_TCC3 
hybridmjrty . count _all_good_votes.TCCl 
hybridmjrty . count .vote s_TCC4 
hybridmjrty . count. all_good_votes_TCC2 
hybridmjrty . count. all.good_votes.TCC4 
hybridmjrty . count. all_good_votes_TCC3 
hybridmjrty . count _votes_TCC5 
hybridmjrty . Hybrid.mjrty.TCC5 
bounded. induct i on . upt o . induct ion 
hybridmjrty . count.votes.TCC3 
hybridmjrty . count.votes.TCCl 
hybridmjrty .Hybrid.mjrty.TCCl 
hybridmjrty . count.votes.TCC2 
hybridmjrty . Hybrid.mjrty.TCC2 


Hinv.holds depends on the following definitions: 
hybridmjrty . count .votes 
hybridmjrty . Hybr id.m j rty 
hybridmjrty . count .all.good.votes 
hybridm j rty . Hinv 


Hlosers has been PROVED. 

The proof chain for Hlosers is COMPLETE. 

Hlosers depends on the following proved theorems 
hybridmjrty .Hybrid.mj rty_TCC3 
hybridmjrty . count.all_good_votes.TCCl 
hybridmjrty. count _votes_TCC4 
hybridmjrty . count_all_good_votes_TCC2 
hybridmjrty . count _all.good_votes.TCC4 
hybridmjrty . count _all.good_votes.TCC3 
hybridmjrty . count _votes_TCCS 
hybridm j rty . Hybr id.m jrty_TCC5 
bounded. induct ion . upto. induct ion 
hybridmjrty . count _votes_TCC3 
hybridmjrty . Hinv.holds 
hybridmjrty . count.votes.TCCl 
hybridmjrty .Hybrid.mj rty.TCCl 
hybridmjrty . count _votes_TCC2 
hybridm j rty . Hybrid_mjrty_TCC2 

Hlosers depends on the following definitions: 
hybridmjrty . count .votes 
hybr idm j rty . Hybr id.m j rty 
hybridmjrty . count.all.good.votes 
hybridmjrty . Hinv 


Hwinner has been PROVED. 

The proof chain for Hwinner is COMPLETE. 

Hwinner depends on the following proved theorems 
hybridmjrty .Hybr id.m jrty.TCC3 
hybridmjrty . count_all_good.votes.TCCl 
hybridmjrty . Hlosers 
hybridmjrty . count. votes_TCC4 
hybridmj rty . count. all.good_votes.TCC2 
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hybridmjrty . count. all.good.votes.TCC4 
hybridmjrty . count.all_good_votes.TCC3 
hybridmjrty . count_votes_TCC5 
hybridmjrty . Hybrid.mjrty.TCC5 
bounded. induct ion . upto. induct ion 
hybridmjrty . count.votes.TCC3 
hybridmjrty . Hinv.holds 
hybridmjrty . count .votes.TCCl 
hybridmjrty . Hybrid.mjrty.TCCl 
hybridmjrty . count _votes_TCC2 
hybridmjrty . Hybrid.mjrty.TCC2 

Hwinner depends on the tollowing definitions: 
hybridmjrty . count.votes 
hybridmjrty . Hybrid.mjrty 
hybridmjrty . count. all.good.votes 
hybridmjrty . Hinv 


A.l Using Hybrid_mjrty 

Given hybrid.mjrty , the following definition is sufficient to satisfy the axioms for 
HybridMajority stated earlier in the theory omh: 


HybridMajority(caucus,v) : T = proj_l(Hybrid_mjrty(caucus,t; I n)) 


The proof of satisfaction of the second axiom given for Hybridmajority from 
Hlosers and Hwinner is a relatively straightforward induction on the number of 
votes. The proof of satisfaction of the first axiom from Hwinner is much more in- 
tricate. The proof proceeds by effectively introducing two lemmas from which the 
proof of the desired property is relatively straightforward. The lemmas are intro- 
duced by the PVS proof command case, which splits the proof into two branches: 
on one branch the lemma can be assumed, on the other it must be proved. The first 
lemma states that if all good processors in a caucus agree on a candidate (value), 
then count.votes returns a value at least as big as the number of good processors. 


(V (p : feu) : g(p) A p € caucus D v(p) = t) 

D |<7$(caucus)| < count.votes(caucus, v, t, n) 




A.l. Using Hybrid-mjrty 
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The other lemma states that if all good processors vote for a nonerror value, and 
all manifest-faulty processors vote for an error value, then the sum of the cardinalities 
of good, arbitrary, and symmetric faulty processors is at least as large as the value 
returned by count.alLgood^votes . 


(V ( p : feu) : g(p ) A p E caucus D v(p) = t) 

A (V (p ; feu) : c(p) A p E caucus D t;(p) = error) 

D 

|ps(caucus)| + |as(caucus)| + |ss(caucus)| > count ^all-good _votes( caucus, v, n) 
V t = error 


Both lemmas are proven by induction on the number of votes. These lemmas do 
not appear in the formal specification, since they are only used internally in proving 
the property Hybridmajority.axl . In fact with subtle proof manipulation the use of 
these lemmas could be eliminated from the proof, although doing so would lengthen 
the proof. From these lemmas and Hwinner the property Hybridmajority.axl fol- 
lows directly. 
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